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KEYNOTE  ADDRESS 


TEST  QUESTIONS 


E.J.  MeCluskey 


CENTER  FOR  RELIABLE  COMPUTING 
Computer  Systems  Laboratory 
Stanford  Unlvarsity,  Stanford,  California  9*305 


The  program  for  FTCS-1  (1971)  had  6  papar 
sessions  and  one  panel  session.  The  panel  session 
was  on  diagnosis  and  testing.  Two  of  the  paper 
sessions  Involved  testing:  "Test  Generation  and 
Diagnosis'1  and  "Fault-Location  and  Testing."  Thus, 
over  one  third  of  the  first  synposlia  was  devoted 
to  testing  Issues. 

There  are  15  paper  sessions,  two  panel 
sessions,  and  one  keynote  session  at  this 
symposium.  Three  of  the  paper  sessions  -  "Design 
Testability,"  "Teat  Generation,"  and  "Self-Test" 
are  clearly  devoted  to  testing  topics.  Another 
session,  "On-Line  Monitoring,"  Is  closely  related 
and  one-half  of  the  papers  In  the  session  on  "VLSI 
Design  Issues"  relate  to  testing.  Somewhat  leas 
than  30t  of  this  symposium  Is  thus  test-related. 
The  attention  given  to  testing  hasn't  changed  very 
ouch  from  the  first  to  the  current  FTCS  Symposium. 

Many  conferences  devoted  entirely  to  testing 
have  started  since  1971:  Cherry  Mill  Test 
Conference  and  Autotestcon  are  probably  the  most 
important  of  these.  The  conferences  on  testing 
typically  cover  very  practical  topics.  They  are 
organized  and  attended  mainly  by  industry  and 
government  people.  An  exception  Is  the  annual 
Design  for  Testability  Workshop,  sponsored  by  the 
IEEE  Test  Technology  Committee,  which  has  a  well 
balanced  participation  from  academia  as  well  as 
industry  and  government.  In  addition,  testing 
papers  have  become  common  In  many  other 
conferences,  most  notably  the  Design  Automation 
Conference. 

Clearly  the  FTCS  activity  has  not  provided  a 
sufficient  vehicle  to  satisfy  all  of  the  current 
Interest  In  testing.  This  Is  particularly  evident 
by  the  fact  that  the  IEEE  Computer  Society  has 
started  another  Technical  Committee  -  Test 
Technology  -  whose  only  topic  is  testing.  Also 
another  Technical  Committee,  the  Computer  Elements 
Committee,  has  now  started  an  Annual  Workshop  on 
Testing.  For  someone  like  myself  who  has  a  major 
lntereat  In  testing  it  has  become  necessary  to  keep 
up  with  the  activities  of  three  technical 
committees  as  well  as  more  than  three  annual 
conferences. 


In  1971  the  FTCS  test  papers  were  concentrated 
on  the  question  of  how  to  generate  (minimus-length) 
test  sets,  and  several  of  them  presented  sequential 
circuit  test  generation  Ideas.  The  emphasis  has 
shifted  significantly  as  evidenced  by  the  present 
conference  having  sessions  on:  "Design  for 
Testability,"  "Self-test,"  and  "On-Line 
Monitoring,"  with  only  one  session  on  "Test 
Generation."  None  of  the  papers  appears  to  be 
specifically  on  sequential  circuits  although 
several  address  microprocessor  testing. 

In  the  11  years  between  the  first  and  the 
current  conference,  the  complexity  of  digital  logic 
has  grown  exponentially.  Computer  circuits  have 
become  ubiquitous  In  western  society.  The 
Increased  complexity  has  led  to  the  realization 
that  cost-effeotlve  automatic  test  pattern 
generation  has  become  Impossible  for  large  designs 
that  do  not  provide  explicit  testability-enhancing 
features.  As  a  result,  there  Is  a  great  deal  of 
interest  In  developing  "Design  for  Testability" 
techniques. 


In  spite  of  much  research,  sequential  circuit 
test  generation  is  still  extremely  expensive. 
Adding  scan  path  facilities  to  e  design  permits 
only  combinational  circuit  test  generation  to  be 
done.  This  technique  Is  fast  becoming  standard  in 
industrial  and  government  designs.  As  complexity 
continues  to  increase.  It  is  becoming  evident  that 
tha  cost  of  generating  combinational  circuit  tests 
and  applying  them  with  a  tester  is  starting  to 
become  too  expensive.  This  has  produced  a  great 
interest  in  the  design  of  "Self-testing"  circuits. 


Although  it  Is  not  illustrated  by  the  program 
of  this  conference,  another  area  of  current  concern 
is  the  question  of  the  fault  coverage  obtained  by 
the  test  technique  used.  With  much  denser  chips 
two  phenomena  come  into  play:  yield  is  lower  and 
the  chance  of  faults  that  are  not  adequately 
modeled  as  single  stuck  faults  increases.  These 
produce  a  requirement  for  higher  fault  coverage 
than  was  necessary  in  the  past.  The  pervasiveness 
of  digital  technology  has  increased  the  need  for 
some  form  of  fault  tolerance.  In  the  test  area 
this  has  caused  increased  attention  to  "On-Line 
monitoring"  as  well  as  Increased  test  quality. 
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MODIFIED  BERGER  CODES  FOR  DETECTION  OF  UNIDIRECTIONAL  ERRORS 


Hao  Dong 


CENTER  FOR  RELIABLE  COMPUTING,  COMPUTER  SYSTEMS  LABORATORY 
Departments  of  Electrical  Engineering  and  Coaputer  Science 
Stanford  Unlveralty,  Stanford,  California  94305,  U.S.A. 


ABSTRACT 

Modified  Berger  codes  are  defined  In  this 
paper.  They  are  less  expensive  than  the  ordinary 
Berger  codes  In  terns  of  the  nunber  of  check  bits 
and  the  cost  of  checkers.  As  a  trade-off,  their 
error  detection  ability  Is  slightly  lower,  although 
these  codes  can  detect  oost  unidirectional  errors. 


INTRODUCTION 

It  Is  seen  that  sone  physical  defects  In  LSI 
or  VLSI  circuits  tend  to  generate  unidirectional 
errors.  There  are  several  classes  of  codes,  such 
as  n-out-of-n  codes,  Berger  codes,  and  two-rail 
codes,  that  aan  be  used  to  detect  unidirectional 
errors.  It  has  also  been  proved  that  Low-Cost  AN 
Codes  and  Inverse  Residue  Codes  with  group  length 
at*1  can  detect  all  unidirectional  errors  of  weight 
less  than  or  equal  to  a  [Uakerly  751,  tWakerly  78]. 

Berger  codes  have  been  proved  to  be  the 
optinal  separable  codes  that  detect  any 
unidirectional  error  [Berger  6T]  [Frelman  62]. 
However,  in  [Frelaan  62],  the  author  pointed  out 
that  one  could  not  sake  a  Berger  code  detect 
unidirectional  errors  of  weight  less  than  or  equal 
to  a  by  slnply  cutting  down  the  nunber  of  the  check 
bits,  where  a  Is  an  Integer  less  than  the  nunber  of 
Information  bits  In  a  codeword.  In  this  paper,  we 
will  define  Modified  Berger  codes  (MB  codes)  so 
that  these  codes  will  detect  all  the  unidirectional 
errors  of  weight  less  than  or  equal  to  a.  Then  we 
will  estlnate  the  actual  error  detection  ability  of 
these  codes.  Totally  self-checking  checkers  for  MB 
codes  are  also  described. 


DEFINITION 

A  codeword  of  a  Berger  code  has  two  parts: 
Information  D  and  check  symbol  C.  Suppose  D  has  I 
bits  and  C  has  k  bits.  Let  II  and  10  be  the  nunber 
of  1's  and  the  number  of  0's,  respectively.  In  the 
I  Information  bits.  The  check  symbol  of  a  codeword 
Is  the  binary  number  of  10,  or  the  complement  (bit 
by  bit)  of  the  binary  number  of  II.  That  Is 

C  *  10  or  C  *  (2k-1)-I1. 


k*[log?(I+1)] 

where  [a]  Is  the  least  integer  greater  than  or 

equal  to  a.  If  I»2k-1  then  using  10  or  II  will 
result  In  the  same  code,  and  the  code  Is  called  the 
Maximal  Length  Berger  Code  [Ashjaee  77]. 

Assume  now  all  the  erroneous  bits  are  within 
the  I  Information  bits.  If  we  use  10  modulo  (m»1) 
or  II  modulo  (a*1)  (l<m<I)  as  the  check  symbol, 
denoted  by  Cl,  then  all  the  unidirectional  errors 
of  weight  less  than  or  equal  to  a  will  be  detected 
by  this  code,  because  no  such  error  could  change 
one  codeword  to  another.  In  this  case 
Jatlogj^D]  bits  are  needed  for  the  aheck  symbol 

Cl,  Let  Pk  (ksO,  1,  ...)  be  the  subset  of  codewords 
In  which  every  codeword  has  an  I1«k.  The  colisan  Cl 
of  Table  1  shows  an  exaaple  of  such  a  code. 


Table  1.  A  coding  example  for  I>8  and  m»7 


Subset 

Codeword  ! 
example  ! 

10 

1 

1 

1 

C1.I0 
aod  8 

( 

1 

1 

1 

C 2 

P0 

oooooooo  : 

8 

1 

1 

000 

1 

1 

111 

PI 

00000001  ! 

7 

1 

1 

111 

1 

1 

000 

P2 

00000011  1 

6 

1 

f 

110 

1 

001 

P3 

00000111  ! 

5 

1 

1 

101 

1 

1 

010 

P4 

00001111  ! 

4 

1 

1 

100 

t 

1 

011 

P5 

00011111  ! 

3 

1 

f 

011 

1 

1 

100 

P6 

00111111  1 

2 

1 

1 

010 

1 

t 

101 

P7 

01111111  ! 

1 

1 

• 

001 

t 

1 

110 

P8 

11111111  ! 

0 

1 

1 

000 

» 

1 

lit 

A  problem  arises  from  the  fact  that  the  errors 
may  also  change  the  check  bits.  For  example,  an 
error  may  change  a  codeword  In  PI  to  a  codeword  In 
P0  with  only  4  erroneous  bits  (Including  the  three 
check  bits) .  As  the  number  J  usually  Is  very 
small  ([logg(m»1)]> ,  it  Is  reasonable  to  use  a 

second  level  code  to  detect  any  error  in  the  check 
bits.  We  may  use  any  of  the  codes  mentioned  In  the 
beginning  of  this  paper  to  encode  the  check  symbol 
Cl  with  another  check  symbol  C2.  These  codes  with 
check  symbol  Cl  and  C2  are  called  Modified  Berger 
codes  In  this  paper  and  the  maximum  weight  of 
errors  detected  by  an  MB  code  Is  denoted  by  m. 

Table  1  shows  an  example  of  MB  codes  with  m*7.  The 
check  symbols  Cl  and  C2  In  Table  1  form  a  two-rail 
code.  In  MB  codes,  because  any  unidirectional 
error  In  the  cheek  bits  Is  detected  by  the  second 


We  have 
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coding,  either  10  aodulo  (m»1)  or  II  modulo  (Bel) 
can  be  used  directly  aa  the  check  symbol  Cl .  It  la 
elaar  that  tha  MB  coda  In  Tabla  1  (with  check 
symbol  Cl  and  C2)  can  detect  any  unidirectional 
error  of  weight  less  than  or  equal  to  7,  and  that 
this  error  detection  ability  Is  effective 
regardless  of  the  number  of  Information  bits  In  the 
code. 


ERROR  COVERAGE 

From  the  definition  of  MB  codes,  we  see  that 
MB  codes  actually  detect  all  unidirectional  errors 
except  those  that  affect  only  the  Information  bits 
AND  have  weight  equal  to  multiples  of  (m»1).  In 
order  to  get  an  idea  of  the  effectiveness  of  MB 
codes,  here  we  give  some  estimation  by  two 
different  error  models. 

Assume  that  the  check  symbol  Cl  Is  encoded  In 
two-rall  code  by  the  check  symbol  C2.  In  this 
section  we  use  the  following  notations: 

a:  the  maximum  weight  of  unidirectional 
errors  detected  by  the  MB  code. 

J:  the  number  of  bits  In  check  symbol  Cl  or 
C2,  J«Clog2(m*1)]; 

II:  the  niaber  of  I's  In  the  Information 
bits; 

10:  the  niaber  of  0's  In  the  Information 
bits; 

1:  the  niaber  of  Information  bits,  1x11*10; 

n:  the  length  of  a  codeword,  n*I+2J. 

The  total  niaber  of  I's  (0's)  In  a  codeword  Is 
I1*J  (I0*J). 


Then  the  conditional  probability  that  a 
unidirectional  error  ocoura  but  Is  not  detected  Is 

Prob  (undetected  unidirectional  error) 

Prob  (any  unidirectional  error) 

A  [((”,)  *  (”)>W 


This  number  will  change  from  codeword  to 
codeword,  but  essentially  It  should  be  very  small 
for  reasonable  values  of  p  and  a  value  of  a  which 
la  greater  than  1 .  Also  this  probability  will 
deoraase  exponentially  when  a  Increases.  The 
reason  for  this  Is  that  the  Independent  error  model 
Implies  lass  probability  for  multiple  errors. 
Although  In  some  cases,  suah  as  a  combinational 
circuit  with  fan-out  points,  the  independent  error 
model  does  not  apply  very  well.  In  general.  It  Is 
usually  true  that  an  error  Is  less  likely  to  occur 
If  It  Involves  more  bits. 


Next  we  consider  another  error  model.  Now  we 
assume  that  all  the  unidirectional  errors,  no 
matter  how  many  bits  they  affect,  have  the  same 
probability  to  occur.  Also  assume  that  all  the 
codewords  have  the  same  likelihood  to  be  the 
output.  The  number  of  error  patterns  In  a  codeword 
Is 


iuj  T,  ,  io*j  „  . 
r  c1  *J>  e.  £  (*?*J) 

1x1  1  1x1  1 


.  (2IUJ-1> 


C2I0*J-1) 


The  niaber  of  codewords  for  II  and  10  is 


First,  let  us  consider  the  Independent  error 
model.  Under  this  model,  an  error  on  an  output 
line  Is  Independent  of  the  status  of  the  other 
outputs.  Suppose  the  probability  that  an  error 
occurs  on  one  output  bit  Is  p,  and  this  probability 
is  uniform  for  every  output  bit,  and  qxl-p.  Then 
the  probability  that  a  unidirectional  error  occurs 
Is 


Prob  (any  unidirectional  error) 


IVJ,IUJ,  „t„  n-1  Iv*J,I0*J,  1  „n-l 

2*  (  ,  >p  d  *  2-  (  i  •> 

1x1  ixl  1 


, .  IUJ,  IW  ,,  I0*J,  IUJ 

x  (1-q  )q  ♦  (1-q  )q 


I0*J  IUJ  ,  n 
x  q  ♦  q  -  2q 


i  np  (  p«1  ) . 


The  probability  that  an  undetected 
unidirectional  error  occurs  Is 


Prob  (undetected  unidirectional  error) 


,11  ,  m*1  n-(m*1)  ,  II  ,  2(m*1)  n-2(m*1) 
(m*1)p  q  *(2(m*1))p  q 


,10  ,  »*1  n-(m*1)  10  ,  2(m*1)  n-2(m*1) 

*  (»*1)p  q  *(2(m*l))p  q 


1  «"l>  * 


<ii> » 


Let  the  total  number  of  error  patterns  be  E. 

He  have 


E  »  ^  I(2IUJ-1)  ♦  (2I0*J-1)J(j1) 


2£  (2IW  -  !)(,,) 

11x0 


The  number  of  undetected  errors  for  each 
codeword  is 


2  <J(mIl>>  *  2  'J(mel) 


(  10  . 


0<J(m*1)<I1 


0<J(m*1)«0 


Let  the  total  number  of  undetected  errors  be 
D.  Then 


D  i 


S  C  2  <J(m»1),+  2 


11x0  0<J(b*1KI1  0<J(m*1KI0 

I 

'  /  , 

j(m*1)  II 


'J(m*1>'j'll' 


•  2  £  £  <«”«)<?.) 


11x0  0<J(m*1)<I1 


(  p<<1  ). 
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Th«  error  coverage  of  an  MB  code  for  all  the 
unidirectional  errors  Is  1-D/E.  Table  2  shows  this 
coverage  for  some  MB  codes  In  which  Cl  and  C2  form 
a  two-rail  code.  It  Is  seen  that  when  the  number 
of  Information  bits  grows,  the  error  coverage  tends 
to  stay  around  a  flxad  figure.  This  Is  a  big 
advantage  for  those  applications  where  the  circuits 
have  a  large  number  of  outputs.  For  a  practical 
circuit,  the  error  patterns  that  might  occur  will 
depend  on  the  funotlon  and  the  structure  of  that 
circuit.  In  general,  the  error  coverage  of  an  MB 
code  should  be  somewhere  between  the  two  models  we 
analysed.  So  we  aan  say  that  MB  codes  will  detect 
most  of  the  unidirectional  errors  that  may  occur  in 
a  circuit. 

Table  2.  Comparison;  Error  coverage 


MB  Codes  !  Berger  Codes 


I 

a+1 

2J 

Coverage*  ! 

k 

Coverage 

16 

4 

4 

93.74*  : 

5 

100* 

32 

4 

4 

93.75*  ! 

6 

100* 

48 

4 

4 

93.75*  ! 

6 

100* 

64 

4 

4 

93.75*  ! 

7 

100* 

16 

8 

6 

99.04*  | 

5 

100* 

32 

8 

6 

98.54*  1 

6 

100* 

48 

8 

6 

98.33*  ! 

6 

100* 

64 

8 

6 

98.47*  1 

7 

1001 

•  Only  for  unidirectional  errors. 

CHECTTHG  CIRCUITS 


In  other  words.  Cl  Is  the  complement  (bit  by  bit) 
of  II  modulo  (m*!).  In  this  case,  circuit  Ml' 
generates  the  weight  of  Information  part  D  modulo 
(m*i).  We  call  such  a  circuit  Ml'  a  modulo  weight 
generator  while  the  ordinary  weight  generators  are 
refered  to  as  full  weight  generators.  Circuit  HZ' 
in  Fig. 2  Is  a  two-rail  code  checker.  The  J  outputs 
of  circuit  Ml',  denoted  by  Cl',  will  then  be 
compared  with  the  check  symbol  Cl  of  the  codeword 
by  the  checker  M2'.  Because  MB  codes  provide  the 
full  code  space  for  the  two-rail  code  checker  N2', 
the  checking  circuit  CHI  described  above  Is  a  TSC 
checker.  The  second  level  coding  of  Cl  and  C2  Is 
cheeked  by  circuit  CH2.  Cl  and  C2  may  form  either 
a  two-rail  code  or  another  Berger  code.  If  Cl  and 
C2  form  a  two-rail  code,  then  CH2  Is  the  same  as 
N2' .  When  no  error  occurs.  Cl  and  C2  are  a 
codeword,  so  are  Cl  and  Cl'.  We  have  C1'*C2,  and 
f*r\  g«g’. 


:  i  i 

I  :  !  !  J  : 

:  !  I  |CV  I  !  ! 

J  :  '  ’  :  M2*  !  : 

ci  - • - ;  i - g 

:  !  I _ !  : 

I  : 

:  I  : 


A  general  design  procedure  for  Totally 
Self-Checking  (TSC)  checkers  of  Berger  codes  was 
presented  In  [Marouf  78].  The  structure  of  these 
checkers  Is  shown  In  Flg.1.  In  the  diagram, 
circuit  Ml  Is  a  weight  generator  which  generates 
the  weight  of  the  Information  part  D,  that  Is,  II. 
Then  the  outputs  of  N1  are  compared  with  the  check 
symbol  C  by  the  comparator  N2,  which  Is  Implemented 
as  a  two-rail  code  checker.  The  weight  generator 
Ml  Is  a  network  of  full  adder  (FA)  and  half  adder 
(HA)  modules.  The  procedures  for  constructing 
different  weight  generators  are  given  in  [Marouf 
78].  This  Berger  code  checker  design  can  be  easily 
modified  for  MB  code  checkers. 


Information  D 


Check  Symbol  C  -A 


Error  Signal 


Figure  t.  Structure  of  Berger  code  checkers 

A  TSC  checker  for  an  MB  code  consists  of  two 
parts  as  shown  In  Fig. 2.  Circuit  CHI  checks  the 
Information  bits  by  generating  the  complement  of 
the  check  symbol  Cl  (by  circuit  Ml')  and  comparing 
It  with  Ct  (by  circuit  H2').  Assume  that  check 
symbol  Cl  Is  defined  as 

Ci  j  (2J-1 )-( II  modulo  m*1  ). 


<J 

C2  W. 


r 


g' 


Figure  2.  Structure  of  MB  code  checkers 


Conceptually  m*1  may  be  any  integer,  but  In 

the  case  that  m+1>2^  the  circuit  Implementation 
of  the  checker  will  be  the  simplest.  In  fact,  a 
modulo  weight  generator  can  be  obtained  directly 
from  the  corresponding  Berger  code  checker.  This 
Is  done  by  keeping  only  the  lowest  J  bits  in  each 
stage  of  weight  representation  and  removing  all  the 
higher  bits  in  the  full  weight  generator.  Fig. 3 
shows  how  a  full  weight  generator  with  15 
information  bits  can  be  modified  to  realize  a 
modulo  weight  generator  with  m»1*4  (modulo  4).  In 
Fig. 3,  numbers  for  the  full  weight  generator  are 
noted  In  ()  if  these  numbers  are  different  from 
those  of  the  modulo  weight  generator.  The  asterisk 
(*)  in  an  adder  module  Indicates  that  for  the 
modulo  weight  generator  the  adder  does  not  have  the 
highest  carry  output.  So,  for  MB  code  checkers, 
the  Jth  bit  of  these  adder  modules  is  simply  a 
three-input  XOR  gate  instead  of  a  full  adder.  It 
is  also  seen  from  Fig. 3  that  the  modulo  weight 
generator  has  less  delay  time  than  the  full  weight 
generator  since  its  last  adder  module  is  one  bit 
shorter. 


5 


!(2):  2(3) 
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I  > 

!  *—  :T3T:  2<<o 

.!  !  2*!-V- 
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_ ;  j  2 

1  1 

1  1 

1  1 

:  ; 

-  -  -  -  »  1  1 

:  ♦— 1(2) 

1  1 

t  < 

1  1 

1  2* 

-w  : 

•s— 1 

2(3)  i 

- 1  !  2  I  T 

1 

1 

J  1  J  »  1 

1 

T  1 

1 

1 

1 

• 

1 

1 

1 

1 

•>  i-blts  adder  with  a  separate 
carry  Input  for  the  lowest  bit 


Figure  3.  Modulo  weight  generator 

Table  3  compares  the  hardware  cost  of  check 
symbol  generators  of  MB  codes  (»♦!**)  with  Berger 
codes  and  low-cost  residue  codes  (group  length**). 
TSC  checkers  for  low-cost  residue  codes  are 
described  In  [Ashjaee  77]  and  [Avlzlenls  71].  The 
hardware  savings  in  Table  3  are  estimated  In  tana 
of  the  nuober  of  devices  for  PLA  Implementations 
and  are  compared  with  Berger  codes. 


Table  3-  Comparison:  Hardware  cost 
for  check  symbol  generators 


Info  : 
bits  : 
I  ! 

Berger 

codes 

FA  HA 

:  lc  : 

!  codes! 

!  FA  ! 

FA 

MB 

HA 

codes 

XOR*  Savings 

15  ! 

11 

0 

1  _  1 

7 

0 

3 

22.7* 

16  ! 

11 

* 

:  12  i 

7 

2 

3 

25.6* 

31  ! 

26 

0 

1  1 

15 

0 

7 

28.8* 

32  : 

26 

5 

:  28  : 

15 

2 

7 

30.7* 

63  ! 

57 

0 

•  1 

31 

0 

15 

32.5* 

6*  ! 

57 

6 

!  60  ! 

31 

2 

15 

33.6* 

•  Three-Input  XOR  gate. 


conclusion 


Modified  Berger  Codes  are  defined  In  this 
paper.  MB  codes  can  detect  all  unidirectional 
errors  of  weight  not  equal  to  multiples  of  a 
predefined  Integer  m *1.  This  error  coverage  is 
greater  than  low-cost  residue  codes  but  less  than 
Berger  codes.  MB  codes  have  fewer  check  bits  than 


the  corresponding  low-cost  codes  or  Berger  codes  In 
most  cases.  Also  MB  codes  can  be  easily  applied  to 
any  number  of  information  bits.  Because  the  number 
of  check  bits  of  an  MB  code  is  independent  of  the 
total  number  of  the  Information  bits,  they  are 
suitable  for  circuits  that  have  a  large  number  of 
outputs,  such  as  PLA's.  It  Is  also  shown  in  the 
paper  that  the  totally  self-checking  checkers  for 
MB  codes  are  less  expensive  and  have  less  time 
delay  than  that  for  either  Berger  codes  or  low-cost 
residue  codes.  All  these  advantages  make  MB  codes 
very  attractive  for  practical  applications. 
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This  paper  daaeribaa  an  analyaia  of  CPU  arrora  at 
tha  Stanford  Linaar  Aecalarator  Cantor  Coaputa- 
tional  Facility.  Tha  atudy  includes  all  claasas  of 
temporary  and  paraanant  CPU  errors.  Nearly  85  per¬ 
cent  of  the  errors  are  temporary  failures.  Me  find 
a  strong  load  dependency  in  tha  errors.  Tha 
observed  tendency  is  present  in  three  years  of  lead 
data.  This  observation  is  significant  because  a 
toad-failure  relationship  found  at  tha  CPU  level 
aust.  in  our  viau.  be  considered  fundamental.  In 
addition,  the  fact  that  most  of  the  errors  are 
transients  or  interai ttenta.  provides  nea  informa¬ 
tion  on  these  error  types  uith  respect  to  their 
load  dependent  behavior.  Our  analysis  procedure, 
used  on  the  SLAC  data.  has  been  validated  on  an 
artificially  created  data  base  eeeded  uith  fail¬ 
ures. 

Keywords)  Statistical  failure  aodels.  uorkload. 
data  analysis. 


IMRODMIIflM 

It  is  uall  knoun  that  as  a  system  approaches  high 
levels  of  utilization.  degradation  in  performance 
occurs  (Ferrari  78].  An  important  question  is 
uhether  Increased  systen  activity  also  results  in 
the  degradation  of  system  reliability.  If  this  is 
true,  the  implications  are  quite  fundamental,  since 
increased  usage  uould  result  in  an*  Increased  risk 
of  error.  Computing  systems,  which  need  aaxiaum 
reliability  at  the  time  of  their  peak  load,  uould 
require  a  reevaluation  of  their  reliability  projec¬ 
tions.  Research  on  the  resolution  of  this  question 
has  been  in  progress  at  the  Center  for  Reliable 
Computing  at  Stanford  University  since  1978.  A 
lack  of  understanding  of  the  eoaplex  physical 
interactions  involved  preclude  analytical  modeling 
at  this  stage.  Accordingly,  our  approach  has  been 
to  assume  no  model  a  priori,  but  rather  start  from 
a  substantial  body  of  empirical  data  on  system  load 
and  failures.  The  object  of  the  project  is  two¬ 
fold: 

1.  To  design  and  Implement  statistical  experiments 
in  an  attempt  to  study  the  dependence  of  fail¬ 
ure  on  load. 

2.  To  develop  models  for  determining  any  cause-ef¬ 
fect  relationships  between  workload  and  fail¬ 
ures. 


The  teehinques  developed  will  fora  an  important 
basis  upon  which  analytical  aodels  and  simulation 
techniques  can  subsequently  be  developed. 

It  is  tha  purpose  of  this  paper  to  report  the 
results  of  our  aost  recent  investigations.  These 
investigations  were  conducted  on  the  IBM  computer 
systea  at  the  Stanford  Linear  Accelerator  Center 
(SLAC)  computational  facility.  An  overview  of  the 
SLAC  systea  configuration  appears  in  [Butner  80]. 
Using  new  techniques  to  aeasure  both  the  workload 
and  hardware  errors  in  a  large  computer  center  for 
a  period  of  three  years,  the  following  were  com¬ 
pleted) 

1.  The  present  study  concentrates  on  CPU  errors. 
A  large  majority  of  these  can  be  classified  as 
transient  or  interai ttent. 

2.  Me  have  now  established  a  completely  new  data 
bass  of  failures  snd  load  which  is  considerably 
superior  to  our  old  data  base  (UNtLOC),  [Butner 
80]  in  depth,  range  and  integrity.  In  particu¬ 
lar,  it  captures  a  detailed  internal  view  of 
the  systea  and  unlike  UNILOC  is  automatically 
collected  data. 

3.  More  significantly,  the  workload  and  failure 
data  were  combined  in  order  to  natch  failures 
with  workloads  at  tha  tines  of  failure. 

4.  The  messurements  and  statistical  experiaents 
clearly  demonstrate  an  Increased  risk  of  CPU 
errors  due  to  increased  values  of  workload 
variables.  Examples  are  CPU  utilization, 
input/output  rate,  snd  interrupt  rstes. 

A  representative  measurement  is  Illustrated  in 
Fig.  1,  which  shows  how  an  increase  in  the  input/ 
output  rate  can  result  in  higher  risk  of  processor 
errors.  The  horizontal  axis  is  the  workload  vari¬ 
able;  the  vertical  axis  is  the  risk  of  error.  Mod¬ 
eling  details  will  be  given  later  in  this  paper. 


Related  Research 

The  failure  data  for  initial  studies,  [Beaudry  78], 
[Butner  80]  and  [Iyer  81],  came  from  the  operator 
maintained  data  base  called  UNILOC.  A  statistical 
analysis  of  UNILOC  failure  data  was  performed  in 
conjunction  with  a  number  of  performance  measures 
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from  the  ISM  SMF 1  data  log.  In  particular,  ua  ana¬ 
lyzed  hardware  and  software  failuraa.  classified  by 
component  types.  The  study  revealed  a  atrong  cor¬ 
relation  betueen  load  and  failures,  although  soft¬ 
ware  failures  correlated  at  a  somewhat  weaker  level 
than  hardware.  Most  importantly,  the  average  over¬ 
all  system  failure  rate  varied  eyclicly  over  a  band 
of  significant  width  as  determined  by  the  daily 
load  variations. 
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Figure  1:  Risk  of  error  increases  with 
increasing  I/O  rate. 


Additional  substantiation  of  this  result  came 
from  results  reported  in  (Castilla  SO],  uhere  a 
constant  failure  rate  model  is  proposed.  In  exper¬ 
imenting  with  data  from  a  DEC  system,  (Castillo  SO] 
found  a  Poisson  model  to  be  valid  only  at  specific 
hourn  of  the  day,  for  particular  load  levels.  Sub¬ 
sequently.  the  same  authors  (Castillo  SI]  proposed 
the  use  of  a  doubly-stochastic  Poisson  process  to 
model  the  cyclic  load-failure  relationship.  The 
model  assumes  that  the  instantaneous  failure  rate 
can  be  described  by  a  cyclostationary  Gaussian  pro¬ 
cess.  In  (Gunther  SO]  a  novel  theoretical  model 
for  an  apparent  dependency  of  failure  on  load, 
based  on  a  random  walk  formulation,  is  described. 


The  next  section  motivates  the  current  work  and 
places  our  pievious  results  in  perspective.  "Meas¬ 
urements''  discusses  the  failure  and  workload  meas¬ 
urements  taken  and  briefly  presents  the  organiza¬ 
tion  of  the  data.  Subsequent  sections  describe  the 
analysis  procedures  and  present  new  results. 
Finally,  we  summarize  the  important  results  and 
highlight  the  conclusions  that  can  be  drawn  from 
them. 

MOT  I VAT  I  OH 

A  user-oriented  real  time  system  is  frequently 


expected  to  respond  to  conditions  which  differ  from 
those  for  which  it  was  modeled  and  evaluated.  Aa 
indicated  earlier,  our  approach  has  been  to  start 
with  a  substantial  body  of  real  data  and  examine  it 
for  a  real  or  apparent  dependency.  In  view  of  our 
previous  results,  we  believe  that  the  error  process 
which  ensues  is  composed  of  two  separate  effects. 
The  first  is  the  (constant)  inherent  failure  rate. 
This  is  determined  through  classical  reliability 
techniques  (Shooman  id],  taking  into  consideration 
such  factors  as  topology,  redundancy  ate.  The  sec¬ 
ond  is  the  utilization-induced  failure  rate.  This 
rate  is  dependent  upon  both  the  absolute  level  of 
system  utilization  and  the  rate  of  change  of  that 
level.  By  an  absolute  level  we  mean  an  obviously 
measurable  level;  e.g.,  CPU  utilization,  memory 
occupancy,  etc.  Through  the  rate  of  change  of  uti¬ 
lization  we  are  attempting  to  measure  the  rate  at 
which  transitions  occur  betueen  various  system 
states,  a.g.  the  transitions  of  the  CPU  into  and 
out  of  the  busy  state.  Although  the  exact  nature 
of  these  effects  is  not  known,  some  underlying 
causes  are  thought  to  be  as  follows: 

It)  Latent  Discovery  Effect:  Many  failures  can 
only  be  detected  when  a  particular  module  or  sub¬ 
system  is  "exercised."  In  other  words  the  system 
can  be  modeled  as  a  load  flow  graph  wherein  we  have 
increased  path  utilization  when  the  load  increases. 
Thus,  although  the  failures  nay  not  be  caused  by 
increased  utilization  they  are  "revealed"  by  this 
factor.  The  time  betueen  the  occurrence  of  failure 
and  manifestation  as  a  system  error  has  been  refer¬ 
red  to  as  "error  latency"  [Shedletaky  7J]. 

(»i>  l "S rent 4  fit,? gut'  The-e  appears  to  exist  a 
correlation  betueen  utilization  and  reliability. 
The  more  often  ue  exercise  information  access  chan¬ 
nels  and  associated  memory  locations  the  greater 
the  temperature  and  increase  in  fatigue. 

till)  Noise:  A  higher  utilization  tevel  results  in 
increased  electronic  noise.  This  can  be  expected 
to  result  in  a  higher  error  probability. 

tiv)  SYPGhronUBti.gn  md  Timing  ^ngmgl  US :  The 
synchronization  or  timing  anomaly  category  includes 
the  failures  due  to  time  dependent  aspects  of  the 
software  and  hardware.  An  error  in  the  access  to 
critical  regions  or  en  unanticipated  toqutneo  of 
status  in  an  inter-computer  communication  protocol 
are  some  examples.  Dependence  upon  level  of  utili¬ 
zation  is  obvious  —  as  a  system  approaches  full 
capacity,  the  "relative"  timings  of  events  can 
fluctuate  widely.  Sequences  of  events  betueen  pro¬ 
cessors  can  change  from  those  originally  antici¬ 
pated  as  one  or  more  of  the  CPUs  nears  saturation. 
A  frequent  source  of  timing  anomaly  is  caused  by 
implicit  assumptions  (often  totally  unintentional) 
regarding  absolute  times  betueen  events.  In  a  real 
system  pushed  near  100  percent  utilization,  timings 
ere  highly  dilated  betueen  system  components  and 
absolute  synchronization  assumptions  can  be  vio- 
1 ated. 


The  IBM  System  Management  Facilities.  for  col¬ 
lecting  accounting  and  performance  data.  See 
(IBM  7 J]  for  detai Is. 


Our  previous 
insight  into  the 


studies  did  provide  us 
above  effects.  Ue  were. 
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pi*.  *bl*  to  so*  that  though  tho  latent  discovery 
ua*  an  taportant  factor,  other  offact*  oar*  Indeed 
present.  Th*  strong  correlations  in  th*  pra-noon 
period  (tha  ”*  o'clock  phenomenon*)  ttutnar  to], 
shous  that  latent  discovery  is  an  inportant  factor. 
Th*  continued  strong  correlation  in  th*  afternoon 
suggests  that  other  affects  are  also  praaant. 

Ua  uara,  houavar,  liaitad  by  th*  fact  that  th* 
failure  data  was  an  asternal,  huaan-co I  lasted  view 
of  th*  systea.  In  order  to  obtain  a  closer  Insight 
Into  th*  problea,  it  was  considered  necessary  to 
study  tha  internal  error  generation  proeaas  and 
deterain*  its  relationship,  if  any,  with  syatea 
activity.  In  particular,  w*  decided  to  concentrate 
on  CPU  errors.  Ah  inportant  reason  for  this  was 
th*  fact  that  little  is  known  regarding  the  behav- 
ior  of  th*  CPU  errors  and  thair  relationship  to 
load.  In  addition,  a  substantial  nuabar  (tS  per¬ 
cent)  of  th*  CPU  errors,  in  tha  period  of  our 
study,  uera  found  to  b*  "soft"  errors,  t.a.  those 
froa  which  tha  aystaa  recovered.  Accordingly,  they 
coa*  in  th*  general  category  of  transient  or  inter- 
aittent  errors  (defined  below).  or  design  errors. 
Again,  relatively  little  is  known  regsrding  tha 
generation  of  these  soft  errors. 

U*  define  an  intaraittant  error  as  on*  due  to  a 
component  on  th*  vorg*  of  failure.  Th*  error  will 
ra-occur  frequently  and  eventually  bacoa*  perma- 
nent.  It  is  generally  believed  that  temporary 
failures  are  four  to  five  times  as  frequent  as  per¬ 
manent  failures  [Ball  69].  Nearly  90  percent  of 
fiold  errors  ar*  believed  attributable  to  this 
class.  Although  a  few  analytical  models  exist, 
they  ar*  estreaaly  restrictive  and  tha  basic 
assumptions  need  validation  tSavir  77].  Statisti¬ 
cal  studies  on  real  data  ar*  few  and  far  between 
[McConnell  79],  Th*  following  soction  givos  a  gen¬ 
eral  overview  of  tha  measurement  techniques  and 
construction  of  tho  data  base. 

MEASUREMENTS 

Error  Ptiauatatat 

As  stated  earlier,  the  present  study  uses  th*  aost 
detailed  data  froa  th*  leg  maintained  by  tha  oper¬ 
ating  systea  as  errors  aro  detected  by  th*  hardware 
and  recorded  by  the  software..  High  level  system 
behavior,  as  seen  by  th*  computer  operator  and 
users,  is  not  directly  aeasured.  Instead,  thsr*  is 
much  Information  on  hardware  errors,  both  permanent 
and  non-permanent  (transient  and  interai ttent) »  as 
they  occur  in  th*  detailed  operation  of  systea  com¬ 
ponents. 


Th*  SIAC  aystaa,  during  th*  period  of  our  study, 
consisted  of  two  IBn  770/161  aainfraaes  and  an  lbn 
360/91  connected  in  a  triplex  mod*.  Th*  data  for 
our  study,  which  consisted  of  three  years  of  meas¬ 
urements  (1979.  I9B0,  and  I9B1),  cam*  froa  th*  two 
IBM  370/16S  mainframes.  Th*  log  referred  to  above 
is  commonly  called  th*  *E*EP"  log,  froa  th*  Envi¬ 
ronmental  According  Editing  and  Printing  program 
used  to  accumulate  and  forest  it  for  maintenance 
[IBM  79].  Not*  that  it  is  significantly  aor*  com¬ 
prehensive  than  UN1L0C  which  is  essentially  an 
external,  human-collected  log. 

Errors  In  IBM  370  ayateas  ar*  classified  into 
three  major  typ*s> 

1.  CPU  Errors  -  In  tha  central  processor  and  stor- 
sg*. 

2.  Channel  Errors  -  In  I/O  channels  and  associated 
interfaces. 

3.  Outboard  Errors  -  In  any  device  beyond  th* 
channel-control  unit  interface,  i.a.  all 
errors  in  I/O  devices. 

Tor  each  error,  whether  recoverable  or  not.  the 
oparnting  systea  creates  a  time-stampad  record 
describing  th*  error  and  providing  relevant  infor¬ 
mation  on  tha  state  of  th*  machine.  As  an  example, 
for  a  CPU  error,  tho  state  information  might 
include  th*  contents  of  all  internal  registers  and 
diagnostic  information  collected  by  th*  hardware 
(such  as  parity  indicators  and  error  flags). 

At  SLAC  this  information  it  col  I acted  on  a  daily 
basis  and  archived  for  many  years.  A  small  sample 
is  presented  in  Table  1. 

workload  ntisurwtnt 

Sine*  errors  in  processors  occur  fairly  infre¬ 
quently  (on  th*  ordar  of  one*  a  day  for  our  meas¬ 
urements),  correlation  with  workload  requires  long 
term  workload  figures.  Pur  workload  data  cones 
from  two  sources:  th*  built-in  systea  utilization 
facility,  and  a  software  monitor  written  specifi¬ 
cally  for  this  study.  They  ar*  discussed  below. 

SMT  Pita.  The  operating  systea*  in  th*  proces¬ 
sors  measured  use  IBM's  Systea  Management  Facili¬ 
ties  ( SHF )  for  usage  accounting.  SMF  ws*  origi¬ 
nally  designed  to  provide  accounting  information, 
but  it  has  evolved  over  th*  years  to  include  aor* 
general  performance  measurement  information.  SMF 
is  discussed  exhaustively  elsewhere  [IBM  73], 
[Butner  SO]  and  will  not  be  detailed  her*. 
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In  general.  SMF  data  consists  of  rooords  giving 
resource  utilization  figures  for  jobs.  files.  I/O 
devices,  and  a  potpourri  of  statistics  gathered  and 
urittan  on  a  periodic  basis.  for  this  uork  us  usa 
tha  type  4  (Step)  record,  uhich  holds  ststistics 
for  each  job  step  as  it  coapletes  execution,  snd 
the  type  1  (Hait)  record,  written  roughly  every  10 
ainutes.  uhich  sumaarizes  global  systea  utilization 
during  that  10  ainute  period.  With  careful  pro¬ 
cessing  Sir  can  provide  excellent  workload  statis¬ 
tics.  especially  when  high  resolution  results  are 
not  needed. 

lWTkkCk  Monitor.  To  obtain  aore  detailed  infor- 
aation  about  transient  behavior  in  the  CPU  we 
iaplaaanted  an  interrupt  rate  aonitor,  called 
INTRACK .  This  software  aonitor  consists  of  two 
components:  tha  interrupt  counters  and  the  INTRACK 
recorder.  There  are  four  clauses  of  interrupts  in 
the  ion  370  archi tecture «* 

t.  external  (EXT)  —  Used  by  the  operating  systea 
for  clocks  and  inter-CPU  comaunication. 

2.  Supervisor  Call  (SVC)  —  Caused  by  any  SVC 
Instruction.  Used  for  operating  systea  servi¬ 
ces.  such  as:  memory  allocation,  synchroniza¬ 
tion.  I/O.  tiaing.  etc, 

3.  Program  (PROG)  —  Program  traps  due  to  srith- 
aetie  conditions  (e.g.  division  by  zero), 
invalid  operations,  or  page  faults. 

4.  Input/Output  (I/O)  -—  from  coaptation  of  I/O 

operations. 

The  operating  aystea  provides  an  interrupt  han¬ 
dler  for  each  class  of  interrupt.  A  counter  field 
and  instruction  to  increment  the  counter  were  added 
at  the  beginning  of  each  interrupt  handler.  These 
counters  start  at  zero  when  the  systea  is  loaded 
and  increase  monotonies! 1 y  until  the  systea  crashes 
or  is  reloaded.  The  counters  have  the  capacity  to 
count  up  to  10".  so  overflow  is  not  a  problem. 

The  INTRACK  recorder  is  a  continuously  running 
program  that  is  automatically  started  every  time 
the  operating  system  is  loaded.  Table  2  summarizes 
the  sources  of  data  for  our  workload  information, 
rigure  2  is  an  example  of  interrupt  rates  derived 
from  the  INTRACK  counters. 

The  Data  Base 

Before  the  load  and  e-ror  data  could  be  analyzed, 
it  was  necessary  to  create  a  coherent  data-base 
which  could  be  used  as  input  in  any  subsequent 
analysis.  This  was  particularly  important  for  the 
workload  data  since  the  records  came  in  varying 
formats  and  types.  As  a  first  step  ue  created 
5-minute  time  averages  for  all  workload  parameters 
for  the  entire  period  of  our  study. 


1  Machine  check  interrupts  are  not  considered  here 
because  they  are  already  collected  in  the  CRCP 
data. 
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Figure  2:  One  day  of  INTRACK-col 1  acted  interrupt 

rates. 

In  order  to  determine  the  load  at  the  time  of 
failure,  the  5-ainute  load  averages  (which  we  refer 
to  as  smeared  averages)  were  merged  with  the  EREP 
log.  The  load  at  failure  was  taken  to  be  the  load 
in  a  five  minute  interval  prior  to  the  failure  to 
eliminate  perturbations  from  systea  error  recovery 
or  s  system  crash.  The  matching  is  shown  in  figure 
3. 


Load  Prior  to  Failure 
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Figure  3:  Merging  of  Load  and  Failure  Data 
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As  a  seoond  step.  we  also  eraatad  an  hourly  smeared 
data  basa.  Tha  creation  of  thasa  data  baaaa  neeea- 
aitatad  complex  processing  in  ordar  to  minimize  tha 
toaa  of  information  which  invariably  aecoapaniaa 
auch  procaduraa.  Tha  aoftwara  ayataa  davalopad  for 
thia  purpoaa  ia  datcribad  in  [Rossetti  >1].  Tha 
ayataa.  whieh  ia  highly  intaractiva.  allowa  offi¬ 
ciant  handling  of  largo  amounts  of  data  (on  tha 
ordar  of  4  x  10*  bytas)  of  varying  foraats  and  coa- 
plaxitias. 

mucus 

lnlllll  1U11  Analysis 

As  an  axaapla  of  a  aiapla  analysia.  lat  ua  suama- 

riza  tha  types  of  aaehina  chocks  that  occur  in  tha 
arror  data.  Ua  uill  count  all  unique  patterns 
found  in  tho  aaehina  ehock  statue  bits  provided  by 
the  hardware.  The  SAS*  program  used  to  generate 
Table  1  is  lass  than  fifteen  lines  long  and  uaa 
very  siaple  to  write.  Each  row  in  this  table  rep¬ 
resents  a  unique  pattern  which  aay  include  one  or 
aoro  indicators  that  aaka  up  tha  arror  type.  Posi¬ 
tions  containing  *— "  aaan  that  tha  corresponding 
indicator  was  not  in  tha  pattern!  abbreviations 
(such  as  sons.  long,  etc.)  are  used  as  anemonics 
for  indicators  that  were  sat  in  tho  pattern.  for 
example,  the  second  row  indicates  that  of  tha  456 
errors  occurring  in  the  three  years.  100  wars  hard¬ 
ware-recovered  storage  errors.  Tha  figures  at  the 
bottom  of  each  column  shew  tha  numhar  and  percent 
of  errors  for  which  the  corresponding  indicator  was 
aat. 

TABLE  3 

Breakdown  of  CPU  Error  Types 


STS 

INST 

STS 

EXT 

0E- 

STOR- 

ana 

pns 

am 

ana 

CHAPE 

4£E 

FREQ 

PCT 

— 

— 

— 

Eono 

— 

— 

169 

37.0 

— 

— 

RCVY 

— 

— 

STRS 

100 

21.9 

— 

idmo 

— 

— 

— 

STRG 

99 

21.7 

— 

— 

RCVY 

— 

— 

— 

46 

10.0 

— 

-- 

— 

tons 

— 

STRS 

21 

4.6 

SDrtO 

— 

— 

— 

— 

— 

11 

2.4 

— 

— 

RCVY 

— 

0ESR 

— 

6 

1.3 

— 

— 

RCVY 

tons 

— 

STRS 

1 

0.2 

— 

IDMO 

-- 

— 

— 

— 

1 

0.2 

— 

lone 

— 

Eons 

— 

— 

1 

0.2 

— 

iono 

— 

Eons 

— 

STRG 

1 

0.2 
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A  careful  examination  rovaals  auch  inforaation 
about  tha  types  of  processor  errors  and  their  rela¬ 
tive  aeverity.  for  example,  external  damage  (EOnci 
occurred  in  a  large  nuabar  af  patterns  (422).  and 
tha  table  indicates  that  in  almost  all  of  those 
cases  no  other  damage  was  datected  in  the  CPU. 
External  damage  is  an  error  occurring  in  an  area  of 
the  syatea  not  directly  connected  with  processing 
the  current  instruction.  Another  frequent  category 
la  system  recovery  (RCVV).  at  34X.  aostly  in  con¬ 
junction  with  some  type  of  storage  error  (STAC). 
Apparently  the  syatea  was  abla  to  recover  by  using 
error  correcting  codes  or  by  retrying  the  instruc¬ 
tion  in  progress.  Notice  that  storage  errors  were 
involved  in  alaoat  half  the  errors  (222  or  492) 
with  about  half  of  those  (101  or  222)  being  immedi¬ 
ately  corrected  by  the  hardware.  In  fact,  other 
tabulations  show  that  tha  remaining  storage  errors 
were  dealt  with  by  operating  aystea  teraination  (54 
or  12X).  and  task  teraination  (76  or  172).  Syatea 
damage  (SDITC),  which  causes  the  operating  syatea  to 
atop  iaaediatoly  after  recording  the  failure, 
occurred  22  of  the  tine.  The  above  shows  that  an 
assortaent  of  fault  recovery  techniques  are  being 
used  and  contribute  markedly  to  overall  syatea  per¬ 
formance.  In  fact,  we  find  that  in  only  14X  of  tha 
errors  does  the  operating  system  stop  processing. 

Harllufl  ind  Error  Analysis 

The  data  consisted  of  three  years  of  load/failuro 
measurements.  1979,  I960  and  19S1.  The  1981  data 

contains  additional  measurements  made  by  our  spe¬ 
cial  purpose  Interrupt  monitor.  Initially,  we  ana¬ 
lyzed  each  year  separately.  Since  there  was  no 
significant  difference  in  tha  1979  and  1980 
results,  it  was  considered  appropriate  to  combine 
the  corresponding  load-failure  data.  Of  the  thir¬ 
teen  workload  measures  collected  for  the  study, 
four  were  choaon  to  be  studied  for  1979  and  1980. 
They  were i 

1.  COSEU  —  The  sum  of  aeaory  allocated  by  batch 
jobs  (K  bytes). 

2.  EXCP*  —  The  I/O  initiatation  rate  by  batch 
jobs  (I/Os  per  second). 

3.  SYSCPU  —  CPU  utilization  for  systea,  i.e. 
non-batch,  tasks  (a  fraction  between  0  and  1). 

4.  T0TCPU  —  Total  CPU  usage  (a  fraction  between  0 
and  1). 

For  1981  the  following  interrupt  aessureaents  were 
also  included: 


11  102  153  193  6  222  Frequency 

22  222  342  422  12  492  Pet.  of  All 

Errors 


1.  SVC  —  Supervisor  calls  (rate  per  second). 

2.  10  —  1/0  interrupts,  completion  of  1/0  opera¬ 
tions  (rate  per  second). 

3.  PROS  —  Progrsa  interrupts  (rate  per  second). 


The  probability  distribution  Jt(x)  of  a  workload 
variable  is  defined  by 

1  The  Statistical  Analysis  System  is  a  powerful 
system  for  managing  and  analyzing  data  ISAS  79], 

It  was  used  for  most  of  the  data  analysis. 


*  An  acronym  for  "Execute  Channel  Prograa' 
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Jt(x)  *  Pr  (workload  *  >}• 

and  alii  ba  called  tha  probability  distribution  of 
load.  Mhan  failuraa  ara  collaetad  and  aatchad  to 
workload.  tha  joint  probability  distribution  of 
failura  and  load  rasults.  and  ia  dafinad  by 

fix)  «  Pr  (failura  occurs  and  load  »  x). 

In  thia  expression,  failures  and  load  values  ara 
raprasantod  as  they  occur  on  an  actual  systea. 
where  favored  loads  contribute  wore  to  tha  distri¬ 
bution  than  loads  of  low  probability.  To  roaiove 
this  affect  we  divide  fix)  by  the  associated  load 
probability  Xix).  Using  tha  wall  known  notion  of  a 
conditional  probability  distribution  (Fallar  61]  wa 
write 

fix) 

glx)  a  Pr  (failure  ocoura  |  load  >  i)  a  . 


Therefore  gix)  can  bo  thought  of  as  tha  probability 
of  a  failura  at  a  given  load  whan  a) 1  loads  ara 
toual lv  represented;  It  is  the  conditional  failura 
probabi) ity. 


mi  im  uAZisji  mm 

In  this  section  wa  describe  and  validate  a  aodal, 
hereafter  referred  to  as  a  load-hazard  aodal.  which 
will  fora  the  basis  of  our  tests  for  a  posaible 
load-failure  dependency.  It  will  be  shown  that  if 
the  load  is  acting  aa  a  stresa  on  the  syateie.  then 
the  load-hazard  will  increase  with  increasing  load. 

Tha  object  of  our  analysis  was  to  detaraine 
whether  a  load-failure  relationship  exists  in  our 
data.  i.a.  whether  a  higher  load  stresses  a  systea 
aore  than  a  lower  load.  In  practical  tones,  if 
such  an  effect  exists,  we  expect  the  load  to  act  as 
a  stress  factor.  The  proposed  node!  is  siailar  in 
nature  to  tha  faailiar  hazard  rate  aodel  froa  reli¬ 
ability  theory,  kecall  that  the  hezerd  rate,  which 
is  the  conditional  probability  that  a  systea  in 
operation  at  tiae  t  will  fail  in  tha  interval 
(t.t+At).  is  defined  in  (Shooaan  6IJ  as> 

Pr  (Failure  in  (t.t*6t>> 

*(t)  x  -  (!) 

Pr  (No  failura  in  (O.t)) 
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In  close  analogy  tilth  (1)  abova  ua  propaaa  a 
lead  dependent  hazard.  Thia  la  Illustrated  by  the 
fallowing  alamentary  hypothatfeal  experiaent. 
Iaagine  that  tha  systea  Is  eparatlng  In  tha  ranga 
0  <  x  <  L>  uhara  x  Is  tha  actual  systaa  load  and  l 
Its  upper  Halt.  Assume  that  ua  have  n  Identical 
aachlnaa  which  are  to  be  tested  for  a  load-failure 
dependency.  Tha  experiment  consists  of  testing 
each  systea  for  failures  for  Increasing  values  of 
x.  Ue  eoaaenca  by  defining  n  increasing  values  of 
x,  I. a.  Xt  <  Xj  <  ...  <  at  which  we  wish  to 

test  our  coaputers.  The  machines  are  first  run  in 
the  range  (0.  Xt>.  The  load  on  each  aachina  Is 
then  increased  from  *i  to  xt  and  tha  number  of 
failures  are  counted.  The  systems  are  then  loaded 
froa  xt  to  xt  and  the  failure  frequencies  estab¬ 
lished.  This  process  is  continued  until  the  aaxi- 
aua  load  tlait  xn  is  reached.  If  failures  are  load 
dependent,  we  expect  that  the  risk  of  a  failure 
will-  increase  with  increasing  x  in  our  experiaent. 
This  will  be  reflected  in  the  corresponding  fre¬ 
quencies.  In  aore  foraal  teras.  we  expect  the 
orobebi 1 itv  that  g  system  wl 1 1  fail  st  g  load  1 ivel 
X  ♦  Ax.  aixin  that  It  lx  currently  running  xt  X. 
will  increase  with  increasing  a. 

The  conditional  probability  described  abova 
bears  a  close  reseablenoa  to  the  classical  haaard 
rate.  Accordingly,  we  define  a  load  hazard  z(x)  as 

Pr  (Failure  in  load  Interval  (x.x+Ax)} 

z(x)  «  -  (2) 

Pr  (No  failure  in  load  interval  (O.xl) 

gfx) 

a  . . 

1  -  #Cx> 

uheret  g(x)  is  tha  conditicnal  failure  probability. 

«(x>  is  its  euaulative  dtstrib.  function. 

If  z(x)  increases  with  x.  it  should  iaply  that  tha 

load  is  acting  as  a  stress  or  wearout  factor.  if, 

however.  z(x)  remains  constant  for  increasing  x,  we 
aay  surmise  an  exponential  relationship  with  load. 

Note  that  in  our  definition  of  load  hazard  we 
have  removed  the  variabi I ity  -  of  systea  load  by 

using  g(x).  Thus  in  the  hypothetical  experiaent 
all  loads  are  equally  represented.  This  of  course 
is  not  true  in  practice  since  losd  is  best 

described  as  a  random  variable  with  a  probability 
distribution!  it  is  simply  the  associated  load  dis¬ 
tribution,  l(x),  defined  above.  In  order  to  deter¬ 
mine  the  hazard  for  a  particular  load  pattern,  we 


must  superiaposa  the  asaoeiatad  load  probability  on 
the  hazard  calculated  in  (2).  Denoting  by  za(x) 
the  transformed  hazard,  wa  hava 

z.(x)  a  i(!)  i(i)  (3) 


Ua  refer  to  the  hazard  z(x>.  as  defined  in  (2), 
as  the  fundamental  hazard.  This  is  because  it  can 
ba  thought  of  as  an  inherent  property  of  a  particu¬ 
lar  system  and  is  not  subject  to  varying  load  pat¬ 
terns.  Uhen  a  varying  load  pattern  is  taken  into 
account,  it  can  be  thought  of  as  "picking  out* 
sspeets  of  tha  fundamental  hazard  function.  This 
hazard  za(x)  defined  in  (3)  will  be  referred  to  as 
tha  apparent  hazard,  since  ft  is  closely  dependent 
on  the  load  distribution. 

Illustrative  Example 

The  following  example  Illustrates  how  a  particular 
workload  can  modify  a  given  fundamental  load  hazard 
z(x).  Figure  6(a)  shows  a  sample  fundamental  haz¬ 
ard  z(x).  Note  that  z(x)  is  Increasing  with  load. 
Thus,  if  all  load  values  are  equally  likely,  the 
systea  has  a  higher  risk  of  failure  at  higher  load 
valuea  than  at  lower  load  values.  Fig.  6(b)  is  a 
hypothetical  load  distribution  where  tha  load  vari¬ 
able  is  the  fractional  CPU  utilization,  with  0  for 
an  idle  CPU  and  1  for  a  fully  busy  CPU.  Finally, 
Fig.  6(e)  gives  the  apparent  hazard  due  to  the 
effect  of  the  load  distribution  in  (a).  The  appar¬ 
ent  hazard  is  now  decreasing  siaply  because  higher 
load  values  are  less  probable. 

oadil  Y»l1fl»tten 

before  using  the  proposed  model  on  the  SLAC  load- 
failure  data,  we  tested  it  on  an  artificially  cre¬ 
ated  data  base.  Our  objeotiva  was  to  test  if 
indeed  the  hazard  aodel  would  predict  a  known 
dependency.  Two  testa  were  perforaed.  In  the 
first  the  load  hazard  was  expected  to  remain 
unchanged  with  increasing  load  (i.e,  that  an  expo¬ 
nential  load-failure  relationship  exists).  Thusi 

-Ax 

Pr  (Load  induced  failure)  c  a 

where i  x  *  systea  load 

A  —  constant  load  hazard  paraaetar 

A  unifora  load  distribution  was  assumed.  An  arti¬ 
ficial  data  base  consisting  of  20,000  load  samples 
(5  minute  averages)  was  ereated.  The  sample  was 


(m)  Fundamental  Hazard 

*4 
04 

1  « 
at 
ao 


(b)  Load  Distribution 

0.4 
M 

3  *« 

SI 

as 


(e)  Apparent  Hazard 


Figure  6i  Example  of  Fundamental  and  Apparent  Hazards 
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then  seeded  with  failure*,  exponentially  related  te 
the  toad  level  (i.e.  load  to  firat  failure  i*  expo¬ 
nential).  An  unbounded  arbitrary  load  parameter 
(e.g  the  I/O  rate)  ua*  assumed.  The  failure*  Here 
generated  uaing  an  inverae  tranaforaation  net hod 
similar  to  that  deaeribed  in  [Fishman  73].  for  a 
hazard  value  of  A  *  0.001.  In  the  aecond  teat,  the 
hazard  ua*  expected  to  inertaae  uith  increaaing 
load  (e.g  a  unifora  load  failure  relationahip) .  A 
bounded  load  paraaeter  (CPU  uaage)  uaa  aodeled.  In 
each  caae  our  hazard  node)  uaa  able  to  pick  out  the 
knoun  dependency.  The  reaulting  fundamental  haz¬ 
ard*.  a*  calculated  by  our  formulation,  are  aheun 
in  figure*  7  and  0. 
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figure  7i  Hazard  Plot:  Exponential  nodal 


figure  0:  Hazard  Plot:  Uniform  nodal 


The  generation  of  the  hazard  plot*  and  aaaociated 
atatiatica  involved  extenaive  data  proceaaing.  In 
each  hazard  plot.  z(x)  or  zt(x)  ia  calculated  and 
plotted  a*  a  function  of  a  chosen  uorktoad  vari¬ 
able,  x.  In  developing  hazard  plots  for  the  load- 
failure  data,  there  is  an  important  difference 
betueen  the  real  and  the  artificially  created  data. 
Thia  lie*  in  the  faot  that,  uhil*  an  artifioal  data 
baa*  has  specific  dependencies  seeded  into  it,  in 
the  real  uorld.  failures  can  oocur  due  to  a  nuaber 
of  cause*.  Exaaples  are:  temperature,  huaidity, 
random  noiae,  mechanical  failures,  and  design 
errors,  soae  of  uhich  are  unrelated  to  our  study. 
Those  factors  not  related  to  load  can  be  expected 
to  behave  aa  noise  in  a  load-failure  analysis.  If 
these  other  factors  are  predominant,  u*  can  expect 
to  find  no  diaeernabl*  pattern  in  our  hazard  plots 
i.e.  they  should  appear  as  uncorrelated  clouds 
(e.g.  see  Fig.  9).  This  is  uell  understood  in  any 
statistical  study  of  dependencies. 


* 


Uncorrelated  Cloud 


figure  9i  Uncorrelated  Hazard  Plot 


An  easily  diaeernabl*  pattern,  on  the  other 
hend,  uould  indicate  that  the  load-failure  depen¬ 
dency  dominates  others.  The  strength  of  such  a 
relationship  can  be  measured  through  regression, 
figures  10,  11,  snd  12  depict  the  hazard  plots  for 
the  three  selected  load  parameters.  The  regression 
coefficient  It2.  uhich  is  an  effective  aeasur*  of 
the  goodness  of  fit.  is  provided  for  each  plot. 
Quit*  simply,  it  measures  the  amount  of  variability 
in  the  data  that  can  be  accounted  for  by  the 
regression  medal.  R*  values  of  greater  than  0.6 
(corresponding  to  an  R  >  0.7$)  are  generally 
interpreted  as  strong  relationships  (Tounger  79].* 


*  The  rang*  of  |r|  from  0 
as  follous:  (0,  0.2S) 
O.S)  moderate:  (0.5, 
(0.75,  1.0)  strong. 


to  1  is  typically  divided 
moderately  ueak;  (0.25, 
0.75)  moderately  strong: 


It  ean  be  scan  that  tha  haiarda  ara 
increasing  with  aach  of  tha  lead  parametsrs  shown. 
Tha  relationship  is  particularly  strong  with  systaa 
CPU  or  total  CPU  as  load  parameters.  Th  u  it  would 
appear  froa  our  data  that  tha  load  parameters  are 
acting  as  a  stress  factor.  i.a.  that  thara  is  an 
increasing  risk  of  failure  with  increasing  load. 

dote,  however.  that  there  la  some  degree  of 
overlap  between  the  various  toad  aeasures  consid¬ 
ered.  Ideally,  one  would  like  to  define  and  esti¬ 
mate  a  multivariate  hazard  function  which  correctly 
reflecta  the  relative  contribution  of  each  load 
measure.  In  order  to  effectively  achieve  this  goal 
it  is  necessary  to  construct  a  multivariate  utili¬ 
zation  function  U  (X,.  . . .  that  relates  the 

many  varied  measures  of  load  to  a  single  concept  of 
»y tern  activity.  It  is  expected  that  the  function  U 
would  depend  strongly  on  system  configuration.  The 
development  of  such  a  model  is  currently  undtr 
investigation. 


conclusion 

The  analysis  shows  that  there  is  a  strong  load 
dependency  of  internal  CPU  errors  at  SIAC.  The 
observed  tendency  is  present  in  three  years  of  load 
data  analyzed.  This  is  significant  because  our 
previously  reported  results  could  only  provide  us 
with  an  external  view  of  permanent  system  and  com¬ 
ponent  failures.  By  examining  the  CPU  error  gener¬ 
ation  process  we  have  been  able  to  study  the  inner 
behavior  of  the  system  and  its  reaction  to  errors. 
Consequently,  ue  have  gathered  the  best  data  possi¬ 
ble.  A  load-failure  relationship  found  at  this 
level  must.  in  our  view,  be  a  fundamental 
phenomenon.  In  addition,  the  fact  that  a  large 
majority  of  these  errors  are  transients  or  inter- 
mi  ttents  provides  new  information  on  these  error 
types  viz.  their  load  dependent  behavior. 

Our  analysis  procedure  has  been  demonstrated  on 
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artificatly  created  data  boat  aaadad  with  fallurta. 
Tha  tuo  hazard  aodtla  propoaad  clearly  differenti- 
ata  between  fundamental  (or  Inharant)  and  aoaarant 
toad  dapandant  falluraa.  An  estimate  of  tha  funda- 
aantal  hazard  z(x),  provides  tha  basic  load-failure 
relationship.  Tha  apparant  hazard  z,(x)  aatfaataa 
how  z(x)  la  modified  by  tha  toad  probabt 1 1 tlaa.  It 
ia.  in  principle,  poaaibla  that  avan  whan  no  inhar¬ 
ant  ratationahip  axiata  batwaan  load  and  failuraa. 
wa  could  coneaivably  obtain  an  apparant  dapandancy 
simply  dua  to  tha  fact  that  aoaa  load  vatuaa  occur 
asra  frequently  than  othara.  At ternatively,  wa  can 
hava  tha  ravaraa  aituation  whara  an  fncraaaing  fun- 
daaantal  hazard  ia  tranaforacd  into  a  non-increas¬ 
ing  or  avan  decreasing  apparant  hazard  by  a  dis- 
tinctiva  load  diatribution. 

Aa  with  any  atatiatical  analysis,  this  is  not 
proof  in  itsalf.  However,  tha  increasing  body  of 
evidence  accumulated  on  diffarant  coaputars  with 
differing  load  and  failure  pattarna  shows  that 
workload  should  be  considered  as  a  factor  ratwting 
to  reliability.  Uorkload  can  bo  thought  of  as  a 
stress  on  tha  syataa.  with  greater  straaaas  result¬ 
ing  in  greater  risk  of  failure.  In  aost  cases  tha 
effect  of  this  stress  is  not  paraanant.  since  aost 
errors  ara  transient.  Tha  design  of  coaiputar  sys¬ 
tems  will  ba  greatly  aided  if  this  type  of  snalysis 
can  help  uncover  causa  and  effect  relationships  in 
hardware  errors. 
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ABSmOL 

Thls  paper  presents  a  procedure  for  aodifylng 
embedded  parity  trees  so  that  they  are  tested  by 
the  Inputs  they  receive  during  noraal,  fault-free, 
operation  of  the  olrcult.  This  eliminates  the  need 
for  direct  control  over  the  Input  lines  of  the 
parity  tree  for  testing  purposes.  The  faults  that 
are  detected  are  single  stuck-faults  at  the 
terminal  lines  of  the  XOR  gates  in  the  tree. 
Applications  of  this  procedure  to  some  other 
parity-related  embedded  code  checkers  are 
presented. 


INTRODUCTION 

A  modular  design  for  complex  VLSI  systems  la 
necessary  for  many  reasons.  One  such  reason  Is 
'design  for  testability”.  It  la  simpler  to  deal 
with  smaller  blocks  when  the  question  of  test 
pattern  generation  or  error  checking  capability  Is 
addressed.  Unfortunately,  a  system  that  consists 
of  self-testing  blocks  Is  not  necessarily  self¬ 
testing.  For  example,  consider  a  network  that 
Includes  a  combinational  circuit  B  with  Inputs  I, 

to  Ip,  outputs  01  to  Ojj,  and  a  parity  tree  C  that 

calculates  the  parity  of  the  outputs  of  B,  as  In 
Fig.  1.  Parity  tree  C  is  tested  for  all  single 
stuck-et  faults  at  the  terminals  of  the  XOR  gates 
by  the  test  Inputs  shown  In  Table  1.  However, 
suppose  that  by  applying  normal  Inputs  to  B, 
outputs  of  B  receive  only  the  patterns  that  are 
listed  in  Fig.  1.  In  this  case,  the  network  of 
Fig.  1  Is  not  self-testing. 

This  simple  example  typifies  the  underlying 
problem  In  building  a  self-testing  network  by 
connecting  self-testing  blocks  together.  The 
problem  Is  that  It  may  be  necessary,  for  applying 
test  patterns,  to  have  direct  control  over  the 
input  lines  of  an  embedded  block,  l.e.,  a  block 
some  of  whose  Input  lines  are  not  primary  network 
Inputs.  Such  direct  control  generally  requires 
extra  pins  and/or  circuitry  on  the  chip  and  adds  to 
the  complexity  of  the  design. 


The  above  problem  was  recognized  and  explicitly 
considered  by  Anderson,  [Anderson  71].  To  build  a 
self-testing  network  from  self-testing  blocks  he 
required  that  each  block  be  fully  exercised,  l.e., 
that  It  receive  all  Its  Input  codewords  with  the 
application  of  codewords  to  the  Inputs  of  the  main 
network.  This,  however,  poses  a  strong  restriction 
on  the  design,  and  for  some  cases  may  be  Impossible 
to  achieve.  Smith  defined  the  concept  of 
sufficiently  exercised  blocks,  which  are  self¬ 
testing  (embedded)  blocks  that  receive  their  test 
Inputs  during  normal,  fault-free,  operation  of  the 
network,  [Smith  76], 

Based  on  Anderson's  results,  Uakerly  concludes 
that  the  general  problem  of  designing  a  network  of 


FIG.  1  An  embedded  parity  tree. 
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TABLE  1  Test  Inputs  for  parity  tree 
C  of  Fig.  1 . 


fully  exercised  (and.  for  that  matter,  sufficiently 
axerclsad)  blocks  Is  a  difficult  on«  to  solva, 
[Vakerly  78].  Hers ,  the  discussion  Is  United  to 
only  one  class  of  blocks,  nanely,  parity  checkers. 
A  set  of  sufficient  conditions  are  stated  for  the 
existence  of  sufficiently  exercised  eabedded  parity 
trees.  If  these  conditions  are  satisfied,  then  a 
slight  oodlflcatlon  of  the  parity  checker  aakes  It 
such  that  the  checker  Is  tested  by  the  Input 
patterns  that  it  receives  during  noraal,  fault- 
free,  operation  of  the  network.  The  faults  that 
are  detected  by  the  noraal  Inputs  are  the  single 
stuck-et  faults  at  the  teralnals  of  the  XOR  gates 
In  the  parity  tree.  The  nodlflcatlon  has  no 
hardware  cost  or  speed  degradation  associated  with 
it. 


NOTATIONS  AND  DEFINITIONS 

Consider  a  network  B*  and  a  combinational  block 
B  In  B*,  as  shown  In  Fig.  2.  B  has  p  Input  lines, 

1,  to  I  ,  and  o  output  lines,  0,  to  0  .  The  output 

1  P  In 

of  B  Is  encoded  using  techniques  such  as  even 
parity  encoding.  C  is  a  checker  that  checks 
whether  the  output  lines  of  8  fora  a  codeword.  C 
aust  have  at  least  two  output  lines,  otherwise.  Its 
only  output  line  aay  be  stuck  at  Its  ■good1'  logic 
value  and  this  fault  cannot  be  detected  by  applying 
codeword  Inputs  to  C.  Therefore,  as  shown  in  Fig. 

2,  assume  that  C  has  two  output  lines.  Usually  the 
output  lines  of  C  fora  a  l-out-of-2  codeword.  Thus 
the  Input  of  C  Is  assuand  to  be  correct  If  and  only 
If  the  output  lines  of  C  carry  coapleaentary  logic 
values.  For  aore  detail.  see  (Carter  68], 
[Anderson  71],  and  [Vakerly  78].  In  this  and  the 
next  section,  assuae  that  the  Input  code  space  of  C 
Is  the  set  of  all  n-blt  words  with  even  parity  and 
the  output  code  space  of  C  Is  the  set  of  l-out-of-2 
words.  Since  Input  lines  of  C  cone  froa  the  output 
lines  of  B,  C  Is  naturally  an  eabedded  block. 
Furthermore,  the  word  patterns  that  the  Input  lines 
of  C  can  receive  during  noraal,  fault-free, 
operation  of  the  network  depend  on  the  logic 
function  of  B,  and  In  general  are  only  a  subset  of 
all  the  even-parity  n-blt  words.  The  aaln 
objective  of  this  work  Is  to  nodlfy  the  parity 
checker  C  such  that  the  norma 1  inputs  of  C  detect 
all  single  stuck  faults  at  the  teralnals  of  the  XOR 
gates  In  C. 

Consider  a  Boolean  aatrlx  M  whose  rows  are  all 
of  the  (distinct)  word  patterns  that  the  n  output 
lines  of  8  receive  during  noraal,  fault-free, 
operation  of  the  network.  If  there  are  a  such 
patterns,  then  aatrlx  M  Is  an  a  by  n  Boolean 
aatrlx.  Note  that  all  the  rows  of  H  have  even 
parity.  Call  H  the  (noraal)  output  aatrlx  of  B. 
The  columns  of  M  denote  the  logic  values  on 
individual  output  lines  of  B  during  normal,  fault- 
free,  network  operation.  Thus,  there  is  a  one-to- 
one  correspondence  between  the  columns  of  M  and  the 
output  lines  of  B.  The  column  in  M  that 
corresponds  to  output  line  0^  of  B  is  called  the 

(normal)  column  corresponding  to  line  0^. 


FIS.  2  The  circuit  under  consideration. 


checker  C  is  obtained  by  partitioning  the  0^  lines 

into  two  arbitrary  groups  of  preferably  equal  or 
alaost  equal  sizes.  Then,  for  each  group,  there  is 
a  parity  tree  that  calculates  the  parity  of  its 
corresponding  lines.  The  output  of  one  of  the 
trees  Is  then  Inverted  and,  under  normal 
conditions,  forms  a  l-out-of-2  code  with  the  output 
of  the  other  tree.  For  aore  detail,  see,  for 
example,  [Vakerly  78],  Figure  3  shows  an  example 
of  such  design  for  11  input  lines.  If  the  lines 

(i.e.,  output  lines  of  B)  are  connected  to  the 
Input  lines  of  C  as  shown  In  Fig.  3.  then  the 
normal  column  corresponding  to  the  1th  Input  of  C 
(froa  the  left)  Is  the  same  as  the  normal  colrnn 
corresponding  to  0^.  However,  note  that  any  0i  can 


FIG.  3  An  example. 


The  traditional  design  of  the  single 


even  parity 
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be  connected  to  eny  Input  line  of  C,  as  long  as 
every  0  line  la  connected  to  eiactly  one  input  of 
C.  This  freedom  is  the  basic  tool  utilized  in  the 
Algorithm  A  of  the  nezt  section.  The  definition  of 
normal  columns  can  now  be  extended  to  apply  to  the 
internal  lines  of  C.  The  normal  column 
corresponding  to  any  line  x  in  C  (Fig.  3)  is  in  m- 
element  Boolean  column  vector  whose  1th  element  is 
the  logic  value  of  line  x  when  the  output  of  B  is 
the  1th  row  of  the  normal  output  matrix  of  B,  for 
ivl,2,....m.  That  is,  the  components  of  the  normal 
column  corresponding  to  a  line  x  are  the  logic 
values  on  line  x  for  all  normal  outputs  of  B, 
applied  in  the  order  they  appear  in  the  normal 
output  matrix  of  B.  A  line  x  in  C  is  an  active 
line  if  it  receives  both  0  and  1  during  normal, 
fault-free,  operation  of  the  circuit.  Thus,  the 
normal  column  corresponding  to  an  active  line  la  a 
nonconstant  oolimin.  A  line  with  constant  normal 
column  la  a  peaalve  line. 


THE  DESIGH  OF  SELF-TESTING  EMBEDDED  PARITY  CHECKERS 

For  a  given  set  F  of  possible  faulta  in  it,  a 
parity  checker  C,  as  shown  in  Fig.  2,  la  said  to  be 
self-testing  if  for  any  fault  f  in  F,  there  is  a 
normal  output  pattern  of  B  that  causes  either  a 
<1,1>  or  a  <0,0>  output  for  C.  Assume  a  design 
such  as  in  Fig.  3  for  C.  Let  the  set  F  of  faults 
consist  of  single  stuak  faults  at  the  Inputs  and 
outputs  of  the  XOR  gates.  Since  there  is  always  a 
sensitized  path  from  any  XOR  gate  terminal  to  the 
output  of  the  parity  tree,  the  embedded  parity 
checker  C  is  self-testing  if  and  only  every  XOR 
gate  terminal  is  an  active  line  [Bossen  70],  Given 
a  parity  checker  C  and  the  normal  output  matrix  for 
the  block  B,  as  exemplified  in  Figs.  2  and  3. 
Algorithm  A  below  Inspects  every  XOR  gete  terminal 
in  C  to  see  whether  it  is  a  passive  line.  If  it 
is,  then  the  Algorithm  finds  a  new  connection 
between  the  output  lines  of  B  and  the  input  lines 
of  C  that  makes  that  line  active.  After  the 
termination  of  the  Algorithm  the  connection 
prescribed  by  it  makes  every  line  in  C  active,  and 
hence  results  in  a  self-testing  embedded  parity 
checker.  In  order  for  the  Algorithm  to  work,  the 
following  conditions  must  be  satisfied: 

A1.  Circuit  C  is  Implemented  with  two-input 
XOR  gates. 

A2.  In  H,  the  normal  output  matrix  of  B,  no 
colimui  is  constant  and  no  two  colixnns 
are  identical  or  complementary. 


column  obtained  by  XORlng  together  the  columns  in 
M( x)  is  constant.  In  other  words,  x  is  passive  if 
end  only  if  the  bit-by-bit  XORlng  of  the  normal 
columns  corresponding  to  input  lines  in  S(x) 
results  in  a  constant  vector.  Otherwise,  x  is 
active.  In  Algorithm  A,  El  and  E2  denote  the  two 
primary  outputs  of  the  parity  checker. 

Algorithm  A: 

t.  If  El  is  passive,  exchange  any  arbitrary  01 
in  S(E1)  with  any  arbitrary  0J  in  S(E2). 

(This  exchange  makes  El  and  E2  active.) 

2.  Hark  El  and  E2. 

3.  Consider  input  lines  a  and  b  of  any  XOR 
gate  with  marked  output  line  and  unmarked 
input  lines.  If,  say,  a  is  passive, 
exchange  any  01  in  3(a)  with  any  0J  in 
S(b).  If  this  exchange  makes  b  passive, 
exchange  01,  which  is  now  in  S(b),  with  Ok, 
a  member  of  S(a)  different  from  0J. 

(If  before  Step  3  both  a  and  b  are  not 
active,  then  this  Step  makes  them  active  in 
at  most  two  exchanges.) 

A.  Hark  a  and  b. 

5,  If  there  are  no  more  unmarked  lines,  EXIT; 
otherwise,  go  to  3. 

In  the  Appendix  It  is  proved  that  if  conditions  A1 
and  A2  are  satisfied,  then  Algorithm  A  makes  the 
parity  checker  C  self-testing. 

EXAHPLE :  Once  again,  consider  the  example  of 
Fig.  3,  with  the  specified  normal  output  matrix  for 
B.  Since  E,  corresponds  to  the  parity  of  the  nrst 

8  Inputs  of  C,  with  the  connection  shown  in  Fig.  3, 
E1  will  have  the  following  normal  column: 

0 

0 

0 

0 

1 

So,  mark  both  E^  and  Eg.  Since  E1  Is  marked, 

consider  lines  x  and  y.  Line  x  is  the  parity  of 
the  first  four  Input  lines  of  C.  Therefore,  the 
normal  column  corresponding  to  x  is  an  all-0 
column;  l.e.,  x  is  a  passive  line.  To  make  x 
active,  exchange  the  connections  of  0g,  which  is  in 

S(x),  and  0 5>  which  is  in  S(y).  This  results  in 


Two  colmns  are  complementary  if  they  are 
complementary  in  all  components.  Note  that 
Assumption  A2  amounts  to  removing  the  redundant 
lines  from  the  output  of  block  B.  If  the  above 
conditions  are  satisfied.  Algorithm  A  below  makes 
the  embedded  parity  checker  C  self-testing.  Note 
that  in  C  (Fig.  3).  any  terminal  of  any  XOR  gate  is 
the  parity  of  a  set  of  input  lines  of  C.  For  line 
x,  this  set  is  denoted  by  S(x).  Let  H(x)  be  the 
binary  matrix  whose  columns  are  the  columns 
corresponding  to  the  fault-free  input  lines  in 
S(x).  To  check  whether  a  line  x  is  a  passive  line, 
that  is,  to  check  if  the  column  corresponding  to  x 
is  a  constant  column,  one  has  to  cheek  whether  the 


matrix  K'  of  Table  2,  which  is  obtained  by 
exchanging  columns  tt  and  5  of  matrix  H.  This 
exchange  makes  x  active;  however,  line  y  becomes 
passive,  as  its  corresponding  normsl  column  is  now 
an  all-1  column.  For  this  case.  Algorithm  A 
cancels  the  latest  exchange,  and  Instead  exchanges 
0g  with  another  member  of  S(x),  say,  0^.  This 

exchange  results  in  matrix  H"  of  Table  2,  which  is 
obtained  from  H  by  exchanging  columns  3  and  5.  It 
makes  both  x  and  y  active.  The  continuation  of  the 
Algorithm  results  in  no  more  exchanges.  The  matrix 
H"  is  translated  into  the  connection  shown  in  Fig. 
A.  All  the  lines  in  parity  checker  C  of  Fig.  A  are 
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active.  Therefor*,  the  embedded  checker  C  Is  self¬ 
testing. 

r  t  s 

11010100110  11100100110 

00101101011  00011101011 

H'  -  10001110101  «*  -  10010110101 

01001011110  01010011110 

01100010111  01001010111 


TABLE  2  Modified  output  matrices  for 
the  example  of  Fig.  3. 


FIG.  4  Self-testing  connection  for 
the  parity  checker  of  Fig.  3. 


Algorithm  A  can  be  used  to  design  self-testing 
embedded  checkers  for  other  parity-related  encoding 
schemes. 


The  self-testing  two-rall  checker  tree  with  n 
Input  pairs,  as  described  In  [Carter  63]  and 
[Anderson  71],  has  a  one-to-one  correspondence  with 
an  n-lnput  parity  tree,  where  each  Input  of  the 
parity  tree  Is  replaced  with  an  Input  pair  from  the 
two-rail  code,  and  each  XOR  gate  Is  replaced  with  a 
two— rail  checker  with  two  Input  pairs  and  a  1-out- 
of-2  output  code.  Fig.  5  shows  a  self-testing  two- 
rail  checker  tree  with  8  input  pairs.  The 
corresponding  parity  tree  for  this  Is  shown  In  Fig. 
6.  If  the  Oj^  lines  of  Fig.  5  satisfy  Assumption 

A2,  then  Algorithm  A  can  be  applied  to  the  circuit 
of  Fig.  6,  and  any  changes  done  on  this  circuit  can 
readily  be  translated  back  Into  the  original  two- 
rail  checker  of  Fig.  5.  If  line  T  of  Fig.  5  (and 
hence  of  Fig. 6)  Is  passive,  the  two— rail  Input 
pairs  should  be  partitioned  Into  two  arbitrary 
groups,  as  in  self-testing  parity  checker  design, 
and  the  pairs  In  each  group  should  have  a  separate 
two-rail  checker.  A  trivial  such  partitioning  for 
circuit  of  Fig.  5  Is  shown  In  Fig.  7.  As  far  as 
speed  is  concerned,  this  is  not  a  good  partition; 
however,  other  partitions  are  possible  that  result 
In  faster  checkers.  Now  each  tree  In  Fig.  7  can  be 
translated  Into  a  parity  tree,  as  described  above. 


•i 


t  t* 


FIG.  5  A  self-testing  two-rail  checker 
tree. 


FIG.  6  Parity-tree  equivalent  of 
Fig.  5. 


FIG.  7  A  partitioned  two-rail  checker. 
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Self-Testing  Embedded  Detector  for  SEC-DED  Circuits 

Single  error  correcting  and  double  error 
detecting  (SEC-DED)  codes  are  a  very  popular  means 
of  checking/correcting  faults  In  memory  arrays.  The 
general  scheme  for  SEC-DED  decoders  Is  shown  In 
Fig.  8.  Such  circuits  are  naturally  eabedded.  In 
particular,  If  they  are  used  with  ROMs,  one  aay  not 
be  able  to  apply  the  required  test  patterns  to 
these  circuits  since  the  contents  of  ROMs  are 
predetermined.  Thus  It  Is  necessary  to  aake 
modifications  to  aake  such  eabedded  SEC-DED 
circuits  self-testing.  Here  we  use  Algorithm  A  to 
make  the  detector  portion  of  the  circuit  self¬ 
testing. 

As  a  specific  example,  consider  16-blt  Input 
data.  This  requires  6  check  bits.  Thus,  In  Fig. 
8,  n*l6  and  m»6 .  Hsiao  has  provided  an  optlaal 
circuit  for  this  case,  (Hsiao  70].  The  same  design 
has  been  used  In  some  commercial  products,  a.g., 
TI's  SH5<t/7«LS630.  Let  the  data  lines  be  denoted 
by  dQ  to  d^,  and  let  the  check  bits  be  o1  to  Cg. 

Figures  9  and  10  show  the  design  of  the  syndrome 
generator  and  the  error  detector,  respectively,  as 
given  In  (Hsiao  70].  The  control  lines  have  been 
left  out  for  simplicity.  The  following  describes  a 
procedure  for  making  the  the  detector  portion  of 


the  olrcult  of  Fig.  8  (i.e.,  circuits  of  Figs.  9 
and  TO)  self-testing.  The  procedure  works  If  all 
the  22  Input  lines  of  the  SEC-DED  circuit  satisfy 
Assumption  A 2. 

First,  each  XOF  tree  block  of  the  syndrome 
generator  aust  be  aodlfled  as  shown  In  Fig.  11. 
For  the  particular  example  at  hand,  there  are  six 
such  parity  checkers,  each  with  a  l-out-of-2  output 
code.  Thus  the  output  of  the  syndrome  generator  Is 
a  two-rail  code  with  six  pairs,  CE^.E^  to 

<Eg,E£>.  If  for  all  normal  Inputs  to  the  circuit 

E1@...@Eg  Is  constant,  take  any  one  of  the  six 

parity  checkers  of  the  syndrome  generator  and 
exchange  line  0^  with  any  of  the  other  eight  lines, 

(Fig.  11).  After  this,  the  above  parity  Is  no 
longer  constant  [Khakbaz  82a].  For  any  of  the  six 
parity  checkers  Just  obtained,  use  algorithm  A  to 
aake  It  self-testing.  Since  all  the  EJ  output 

lines  of  the  syndrome  generator  are  (Inverted) 
prlaary  inputs  to  the  SEC-DED  circuit,  they  satisfy 
Assumption  A 2.  Hence  an  (embedded)  self-testing 
two-rail  checker  can  be  designed,  as  described 
above.  This  replaoes  the  OR  gate  of  Fig.  10. 


FIG.  8  SEC-DED  decoder  circuit. 


FIG.  9  The  syndrome  generator. 
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FIG.  12  Modified  detector  portion  for 
self-testing  embedded  SEC-DED 
decoder  circuit. 

generator,  one  erroneous  Input  results  in  en  odd 
number  of  erroneous  lines  at  the  output  of  the 
syndrome  generator,  i.e.,  of  the  12  output  lines  of 
the  syndrome  generator,  an  odd  number  would  be 
erroneous.  Similarly,  if  two  input  lines  are 
erroneous,  an  even  number  of  the  output  lines  of 
the  syndrome  generator  will  be  erroneous.  It  is 
not  hard  to  see  that  the  <P,Q>  pair  of  Fig,  12 
forms  a  1-out-of-2  code  if  and  only  if  there  are  an 
even  number  of  errors  on  the  12  input  lines  to  the 
two  parity  trees.  This  argument  leads  to  Table  3, 
which  indicates  how  to  interpret  the  outputs  of  the 
circuit  of  Fig.  12. 


FIG.  11  Modified  parity  tree  for 
the  syndrome  generator. 


Finally,  to  distinguish  between  single  and  double 
errors,  the  Et  lines  are  input  into  a  parity  tree. 
Similarly,  a1  second  parity  tree  calculates  the 
parity  of  the  Ej  lines.  Since  the  parity  of  the 

E1  lines  (similarly  Ej  lines)  were  made  to  be 

nonconstant,  and  since  the  E^  lines  (similarly  EJ 

lines)  satisfy  Assumption  A2,  they  can  be  made 
self-testing  using  Algorithm  A.  Figure  12  shows 
the  self-testing  embedded  error  detector  portion  of 
the  SEC-DED  circuit  of  Fig.  8.  If  all  the  inputs 
to  the  SEC-DED  circuit  are  correct,  then  the 
syndrome  generator  produces  a  two-rail  code  and 
<R,R'>  results  in  a  1-out-of-2  code.  If  one  or 
more  input  lines  are  erroneous,  then  the  output  of 
the  syndrome  generator  will  not  be  two-rail,  and 
hence  <R,R’>  does  not  form  a  1-out-of-2  code.  By 
the  special  encoding  of  [Hsiao  70]  that  is  used 
here,  and  by  the  special  design  of  the  syndrome 


«  If  0  or  1 . 


TABLE  3  Reading  of  outputs  of  the 
circuit  of  Fig.  12. 
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CONCLUSION 

A  procedure  has  been  developed  for  dealgnlng 
embedded  parity  checkera  that  are  aelf-teatlng  for 
all  alngle  atuck-at  faulta  at  the  terainala  of  the 
XOR  gatea.  Thla  procedure  haa  no  hardware  coat  or 
apeed  degradation  aaaoclated  with  It.  However, 
appllcatlona  of  It  to  other  parity-related  code 
checkera  nay  have  a  alight  apeed  penalty  (e.g.. 
Fig.  11). 

There  la  much  room  for  eipanding  the  ldeaa  and 
methods  preaented  In  thla  paper.  In  particular  (1) 
work  needs  to  be  done  on  finding  other  codea  for 
which  self-testing  embedded  checkers  can  be 
designed,  and  (2)  other  algorithms  should  be 
developed  for  detecting  a  aore  extenalve  set  of 
faulta  In  the  checker.  One  such  algorithm  has  been 
developed  recently  that  results  In  an  embedded 
parity  tree  that  la  aelf-teatlng  for  all  faulta 
within  any  single  XOR  gate  [Khakbaz  82b], 
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appendix 

Consider  a  set  V  of  m-blt  binary  vectors.  If  v  Is 
In  V,  the  set  obtained  by  removing  v  from  v  js 
denoted  by  V-v.  Also,  the  set  obtained  by  adding  a 
new  vector  w  to  V  Is  denoted  by  V«w.  Let  p(V)  be  a 
vector  obtained  by  blt-by-blt  XORlng  of  the  vectors 
In  V.  That  Is,  the  1th  element  In  p(V)  Is  the 
parity  of  the  1th  row  of  a  matrix  whose  columns  are 
the  members  of  V.  Call  p(V)  the  parity  vector 
corresponding  to  V.  Similarly,  define  p(V-v), 
p(V«w),  and  so  on. 

LEMMA  1  Let  u,  v,  and  w  be  m-blt  binary  column 
vectors.  If  both  u@v  and  u@w  are  constant 
vectors,  then  either  v  and  w  are  Identical  or  they 
are  complements  of  each  othar. 

ijwhx  p  Let  V  be  a  set  of  m-blt  binary  column 
vectors.  Let  v  be  In  V.  Than,  p(V)  »  p(V-v)gv, 

LEMMA  3  Let  V  be  a  set  of  m-blt  binary  column 
vectors.  Let  w  be  an  m-blt  binary  column  vector 
not  In  V.  Then,  p (V«w)  a  p(V)  ®w. 

The  proofs  of  the  above  Lemmas  are  simple  and 
directly  follow  the  definitions. 

LEMMA  *  If  Assumptions  A1  and  A2  hold  for  the 
parity  checker  C,  as  exemplified  In  Fig.  3,  then 
Algorithm  A  makes  all  the  lines  In  C  active. 

PROOF  First,  show  that  after  Step  1,  both  El  and 
E2  are  active.  If  they  are  active  from  the 
beginning.  Step  1  does  nothing,  and  tho  assertion 
is  trivially  true.  So,  assume  El  is  passive  at  the 
beginning.  Since  the  normal  column  corresponding 
to  El  Is  complementary  to  that  of  E2,  E2  would  also 
be  passlva.  Thus  p(3(E1))  and  p(S(E2))  are 
constant  vectors.  The  exchange  In  Step  1  results 
in  two  new  sets  of  Input  lines  corresponding  to  El 
and  E2,  as  follows: 

S'(EI)  »  SCE1)  -  01  ♦  Oj; 
and  S' (E2)  »  S(E2>  -  0J  *01. 

But  originally 

S(E  1 )  »  S(E1)  -  01  *01; 
and  SCE2)  *  S<E2)  -  0J  ♦  0J. 

If  p(S'(E1))  Is  also  constant,  then  by  Lemmas  1,  2, 
and  3  It  Is  concluded  that  01  and  0J  are  Identical 
or  complementary,  contradicting  Assumption  A2. 
Similarly,  It  can  be  shown  that  p(S'(E2))  Is  not 
constant.  Thus  El  and  E2  are  active  at  the  end  of 
Step  1. 

Now  consider  Step  3.  Let  a  and  b  be  the  two  Inputs 
to  a  gate  whose  output  has  been  marked,  but  whose 
inputs  have  not  been  marked.  If  both  a  and  b  are 
active,  they  are  marked.  If,  say,  a  Is  passive, 
exchange  01  In  S(a>  with  0J  In  S(b)  to  get: 
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s'(«)  *  s(«)  -  01  ♦  oj :  (i) 

and  S'  (b)  »  S(b)  -  Oj  ♦  01.  (2) 

By  a  similar  argument  as  above,  it  can  ba  shown 
that  this  exchange  makes  a  active.  However,  now  b 
may  be  passive.  In  this  case  the  specified 
exchange  results  In 

S-Ca)  »  S'(a)  -  Ok  ♦  01:  (3> 

and  S”(b)  *  S'(b)  -  01  ♦  Ok.  (*) 

Existence  of  such  Ok  different  from  OJ  In  S' (a)  Is 
guaranteed,  since  otherwise  S'(s)  and  hence  S(a) 
would  have  only  one  member,  which  by  Assumption  A2 
would  Imply  that  In  fact  a  could  not  have  been 
passive  to  start  with.  Substituting  (1)  In  (3)  and 
(2)  in  («): 

S"(a)  »  3(a)  -  Ok  ♦  OJ;  (5) 

and  S"(b)  *  S(b)  -  OJ  ♦  Ok.  (6) 

Since  S(a)*S(a)-Ok*Ok,  and  p(S(a?)  Is  assumed  to  be 
constant  (since  a  was  originally  passive) ,  then 
p(S"(a))  may  not  be  constant;  otherwise  (5)  and  the 
above  Lenaas  would  yield  that  Ok  la  Identical  or 
complementary  to  OJ.  Also  since  it  was  assuaed 
that  p(S'(b))  was  constant  (l.e,  since  It  was 
assumed  that  the  first  exchange  between  01  and  OJ 
made  b  passive),  then  p(S"(b))  would  not  be 
constant;  otherwise,  (2)  and  (6)  would  imply  that 


01  Is  either  identical  or  complementary  to  Ok. 
Therefore  In  at  most  two  exchanges  in  Step  3  lines 
a  and  b  become  active  and  ara  subsequently  marked. 
Since  each  time  that  Step  3  Is  executed  two  lines 
are  marked,  the  Algorithm  stops  In  a  time 
proportional  to  the  number  of  the  lines  In  the 
tree.  Finally,  If  t  Is  the  output  line  of  the  gate 
with  input  lines  a  and  b,  then,  by  the  structure  of 
the  tree,  S(t)  »  S(a)  UNION  S(b).  Thus,  any 
exchanges  between  S(a)  and  S(b)  do  not  affect  the 
fact  that  t  (or,  for  that  matter,  any  ancestor  of 
t)  Is  an  active  line.  In  other  words,  during  the 
process  of  the  Algorithm,  all  the  marked  lines 
remain  active.  Q.E.D. 

THEOREM  If  Assumptions  A1  and  A2  hold  for  the 
parity  checker  C,  then  Algorithm  A  makes  C  self¬ 
tasting  for  all  single  stuck-at  faults  at  the 
terminal  nodes  of  the  XOR  gates. 

PROOF  By  Lemma  A,  the  Algorithm  A  makes  all  the 
lines  In  C  active.  Suppose  line  x  In  C  Is  stuck  at 
u  (u  Is  0  or  1).  Since  x  Is  active,  there  la  a 
normal  input  pattern  to  C  that,  under  fault-free 
condition,  puts  logic  value  u'  on  x.  Assume  x  Is 
in  S(E1).  Then,  with  x  stuck  at  u,  the  above  Input 
pattern  causes  erroneous  logic  value  on  El.  Thus, 
<£1,E2>  does  not  form  a  t-out-of-2 
codeword.  O.E.D. 
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ABSTRACT 

Applications  of  watchdog  processors  for 
detection  of  system  malfunctions  are  described. 
Low-cost  watchdog  processors  can  be  designed  so 
that  they  have  knowledge  about  the  design 
specifications  of  a  system  and  therefore  can  detect 
a  large  class  of  malfunctions  by  monitoring  the 
run-time  behavior  or  that  system. 

The  concept  of  capsbillty  checking  la 
Introduced.  Capability  checking  Is  aimed  at  the 
detection  of  malfunctions  that  cause  illegal  access 
to  the  memory  system.  It  Is  shorn  that  only  a 
subset  of  such  malfunctions  is  detected  by  the 
operating  system.  In  the  capability  checking 
technique  all  access-right  information  is  given  in 
advance  to  an  auxiliary  low-cost  processor,  called 
a  -capability  processor.  The  capability  processor 
checks  the  validity  of  each  access  to  the  memory 
system  dynamically.  The  implementation  details  of 
a  capability  processor  are  explained. 


INTRODUCTION 

One  of  the  most  basic  techniques  for  checking 
the  behavior  of  a  system  is  the  use  of  a  watchdog 
timer  CCon  72),  COrn  75).  The  system  is  designed 
such  that  under  normal  operation  it  signals  the 
watchdog  timer  within  a  specified  time  Interval. 
This  signal  presets  the  timer  to  its  initial  value. 
The  timer  generates  an  error  if  no  preset  signal  is 
received  during  that  specified  time  Interval.  It 
is  obvious  that  many  malfunctions  can  occur  while 
the  system  still  generates  a  correct  timing  signal. 
In  this  paper  we  study  the  design  of  low-cost,  yet 
more  sophisticated  watchdog  processors  for 
concurrent  testing  of  a  system.  Watchdog  processors 
tLu  80]  can  be  designed  to  have  more  knowledge 
about  the  design  specifications  of  a  system  anq 
hence  be  able  to  detect  abnormal  behaviors  of  that 
system  at  runtime t 

Several  approaches  [Bol  78],  tlao  80]  have  been 
proposed  for  detection  of  malfunctions  which  result 
in  control  flow  errors.  Equally  Important  is  to 
study  the  effect  of  these  malfunctions  on  the  way 
the  memory  is  referenced.  This  would  be  an  attempt 
to~  detect  system  malfunctions  as  well  as  to  prevent 
memory  mutilation. 


The  idea  of  capability  checking  presented  here 
is  to  use  an  auxiliary  low-cost  processor,  called  a 
Capability  Processor  (CP),  to  verify  the  validity 
of  memory  rtferenees.  A  typical  configuration  for 
the  system  is  shown  in  Fig.  1.  The  capability 
processor  operates  in  parallel  with  the  CPU  and 
detects  a  large  class  of  illegal  accesses  to  the 
memory  system;  a  subset  of  these  illegal  accesses 
also  is  detected  by  the  operating  system. 


Figure  1 


SISIEti  LEVEL  HAimCTiaMS 

Classical  methods  of  testing  concentrate  on 
functional  testing  at  the  circuit  level. 
Unfortunately  there  exists  a  gap  between  the  effect 
of  faults  at  the  circuit  level  and  their  behavior 
at  the  system  level.  As  a  very  simple  example 
consider  a  memory  system  that  uses  extra  check  bits 
for  error  detection  and  correction.  Some  multiple 
bit  errors  may  go  undetected  at  the  circuit  level. 
At  the  system  level  this  may  correspond  to  changing 
a  correct  instruction  to  an  incorrect  one  causing 
the  program  to  perform  a  different  operation.  We 
can  detect  such  errors  if: 

1)  The  design  specifications  (from  which  the 
behavior  of  the  system  can  be  predicted)  are  known. 

2)  The  errors  cause  abnormal  behavior  of  the 
system. 


Much  research  has  been  done  In  the  area  of 
operating  systems  which  support  protection  [Sal 
75].  In  a  typical  descriptor-based  system  such  as 
the  IBM  S/370  or  the  PDP-11/45  [Sal  75]  the 
operating  system  loads  the  descriptor  register  with 
the  base,  limit,  and  the  access  right  Information. 
On  the  other  hand  In  a  capability-based  system  [Fab 
74]  such  as  the  PLESSY  S/250  [Eng  74]  or  the 
Cambridge  CAP  computer  [Ull  79]  the  users 
themselves  can  load  the  descriptor  register  but 
only  from  a  limited  sec  of  descriptor  values  (or 
capabilities)  that  has  been  given  to  them  by  the 
operating  system. 

Information  used  for  the  purpose  of  protection 
Is  stored  In  the  memory,  and  in  general,  all 
protection  systems  assume  fault-free  hardware.  This 
assumption,  however,  can  be  invalidated  and 
protection  violations  can  go  undetected.  This 
problem  becomes  more  serious  in  virtual  memory 
systems  or  capability-based  systems  where  many  page 
tables  or  capability-lists  are  stored  In  the  main 
memory.  There  are  three  categories  of  errors  that 
may  not  be  detected  by  the  operating  system: 

1)  Errors  In  a  memory  word,  protection 
registers,  address  bus,  etc.  caused  by  hardware 
failure. 

2)  A  software  error  (accidental  or  malicious) 
In  a  user  program. 

3)  A  software  error  In  a  system  routine  which 
Is  assumed  to  be  highly  trustabla. 

As  an  example  of  a  hardware  failure  that  can 
result  In  protection  violations,  consider  the 
paging  system  In  the  VAX-11  [Lav  80],  A  program 
references  the  memory  by  giving  the  Virtual  Page 
Humber  (VPN)  and  an  offset  in  that  paga.  The  VPN 
points  to  an  entry  of  the  Process  Page  Table  (PPT). 
A  Process  Base  Register  (PBH)  points  to  the  PPT. 
The  physical  address  Is  formed  by  concatenation  of 
the  Page  Frame  Number  (PFN),  derived  from  the  Page 
Table  Entry  (PTE),  and  the  offset  In  the 
Instruction,  as  shown  In  Fig.  2.  The  following 
hardware  failures  can  cause  a  wrong  memory  access 
not  detected  by  the  protection  system: 
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a.  An  error  In  the  PBR  or  PTE  (PBR  and  PTE  can 
be  changed  only  by  the  OS) . 

b.  An  error  In  the  VPN.  (part  of  the  address 
In  an  Instruction) . 

c.  Failure  of  the  access  check  mechanism  (CPU 
failure)  and  Invalid  access  attempt. 

d.  A  fault  on  the  address  bus. 

On  the  other  hand,  most  software  errors  are  due 
to  design  and  coding  errors  and  In  general  It  Is 
very  difficult  to  guarantee  that  once  the  software 
passed  its  test,  it  is  free  of  any  errors  [Yao  80]. 


Before  proceeding  to  the  subject  of  capability 
checking  It  is  helpful  to  define  the  terms  which 
are  used  In  this  paper. 

Ptflniilon;  a  system  level  malfunction  Is  a 
deviation  In  the  behavior  of  a  system  from  Its 
design  specifications  as  a  result  of  a  hardware 
failure,  a  software  error,  or  a  design  error. 

In  the  presence  of  a  system  level  malfunction 
the  operation  performed  by  the  system  Is  either 
illegal  or  lncorreot. 

Definition:  An  operation  Is  Illegal  if,  based 
on  the  design  specifications,  that  operation  Is 
never  allowed.  For  example  execution  from  a  "data* 
segment  Is  an  Illegal  operation. 

Definition:  An  operation  is  incorrect  If  based 
on  the  design  specifications  and  the  current 
conditions,  that  operation  Is  not  correct.  However 
the  same  operation  can  be  correct  under  certain 
conditions. 

For  example  If  a  program  can  write  Into  two 
different  data  segments  SI  and  S2  depending  on  the 
value  of  a  predicate  "P",  an  Incorrect  operation 
would  be  to  write  Into  S2  Instead  of  SI  as  a  result 
of  an  error  In  "P". 

In  general,  detection  of  Incorrect  operations 
is  more  difficult  than  detection  of  Illegal 
operations.  Most  Incorrect  operations  occur  as  a 
result  of  Incorrect  decisions  at  branch  points. 
These  decisions  In  general  can  depend  on  the  Input 
data.  Redundant  predicates  can  be  used  to  minimize 
the  probability  of  a  wrong  decision  [Kan  75].  In 
this  paper  we  concentrate  mostly  on  the  detection 
of  Illegal  operations,  although  some  Incorrect 
operations  can  also  be  detected. 

Definition:  An  object  Is  a  set  of  logically 
contiguous  memory  cells  whose  type  determines  the 
class  of  operations  that  can  be  performed  on  it.  A 
process  can  have  a  set  of  owned  objects  with  full 
access  to  them.  In  addition,  a  process  can  be 
given  access  to  some  (external)  objects  by  the 
owner  of  those  objects.  Examples  of  objects  are:  a 
program,  a  data  segment,  or  a  page. 


Figure  2 
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Definition:  A  capability  to  an  object  la  a 
special  name  that  allows  a  specific  access  to  that 
object.  It  has  a  unique  logical  address  field,  a 
type  and  an  access  right  field. 

At  any  given  time  a  set  0*(01,02 . On) 

represents  the  set  of  all  active  objects. 

Cls(C11,C12 . Clk)  Is  the  set  of  capabilities 

that  are  given  to  the  object  01.  An  object  01  has 
to  present  a  capability  C1J  In  order  to  access  the 
object  0J.  The  access  right  Is  alj.  The  operation 
of  the  capability  processor  Is  as  follows: 

A.  From  the  point  of  view  of  the  capability 
processor  each  process  Is  defined  by  a  set  of  coda 
and  data  objects.  The  set  of  aatlve  objects 
(stored  In  the  physical  segments  of  the  primary 
memory)  can  be  represented  by  a  directed  graph;  an 
example  Is  shown  In  Fig.  3.  A  vertex  In  this  graph 
represents  an  object.  An  edge  shows  the  access 
right  of  an  object  to  another. 


Figure  3 


Before  a  program  Is  Initiated,  all  access-right 
information  Is  sent  to  the  CP.  This  is  done  by 
loading  the  Segment  Access  Table  (SAT)  and  the 
Segment  Hap  Table  (SHT)  In  the  CP,  The  row  Si  (SI 
Is  the  segment  ID  or  the  object  01)  In  the  SAT 
contains  the  set  of  access  rights  for  the  object  01 
(l.e.  alj  for  Js1..k).  An  entry  SAT(Sl.SJ)  In  the 
SAT  shows  the  access  right  of  the  object  01  to  the 
object  OJ.  A  null  entry  denotes  the  no  access 
situation.  For  any  coda  object  01,  the  entry 
SAT(Sl.Sl)  la  an  execute  access  right.  An  object 
Ok  can  be  shared  between  two  code  objects  01  and  OJ 
with  different  access  rights: 

SAT(S1 ,Sk)xaik  and  SATlSJ.Sk)saJk  ;alkqtajk 

B.  For  each  memory  reference,  the  physical 
address  is  translated  to  a  segment  ID  using  the 
SHT.  This  segment  ID  is  used  In  turn  as  the 
address  for  accessing  the  SAT.  Two  segment  IDs  are 
required  to  access  the  SAT.  The  first  is  the 
segment  ID  (Si)  of  the  current  code  object  (01). 
The  second  Is  tfie  segment  ID  (SJ)  of  the  object 
(OJ)  referenced  by  the  current  object.  SI  and  SJ 
are  determined  from  the  physical  address  In  each 
reference  and  the  mapping  Information  in  the  SHT. 
If  the  requested  access  by  the  CPU  is  not 
consistent  with  SAT(Sl.SJ)  which  is  read  out  from 
the  SAT,  the  capability  processor  signals  the  main 
processor  and  the  main  processor  initiates  a 
recovery  routine  for  handling  the  detected  error. 


IMPLEMENTATION 

The  following  assumptions  are  made:  First,  all 
access  Information  given  to  the  capability 
processor  by  the  main  processor  Is  error  free.  If 
not,  the  CP  may  signal  an  access  error  while  an 
access  is  legal  and  fall  to  signal  while  the  access 
Is  Illegal.  Second,  the  probability  of  simultaneous 
failure  of  the  main  system  and  the  capability 
processor  is  very  low.  Third,  all  accesses  from 
any  location  In  an  object  Ox  to  any  location  in 
another  object  Oy  are  "equivalent".  In  other  words 
If  the  object  Ox  can  write  Into  the  object  Oy,  this 
technique  would  not  check  whether  or  not  the 
referenced  location  within  Oy  is  correct. 

The  first  stage  of  the  capability  processor  Is 
an  address  translator.  It  translates  a  physical 
address  Into  a  segment  ID  using  the  mapping 
information,  and  it  determines  the  type  of  the 
segment  (code  or  data).  In  the  case  of  a  paged 
memory  system  all  references  to  different  pages  of 
an  object  are  mapped  onto  a  unique  segment  ID  for 
that  object  using  the  SHT.  This  requires  one 
access  to  the  SHT. 

In  Fig.  4  register  Rx  holds  the  segment  ID  of 
the  current  code  segment,  SI,  which  Is  determined 
from  the  current  memory  reference  using  the  mapping 
data  In  the  SHT.  The  segment  ID  for  the  next 
reference  to  the  memory,  SJ,  Is  also  determined  and 
loaded  Into  register  Ry  by  the  capability 
processor.  The  entry  SAT(Sl.SJ)  is  read  out  from 
the  SAT  and  Is  compared  with  the  access  requested 
by  the  CPU.  The  watchdog  signals  an  error  If  this 
comparison  falls.  Notice  that  In  this  method  the 
capability  processor  checks  the  validity  of  each 
access  In  parallel  with  the  CPU  operation.  Once  a 
successful  access  Is  completed,  the  capability 
processor  loads  Rx  from  Ry  only  If  Ry  holds  the 
segment  ID  of  a  code  segment.  This  operation  Is 
repeated  for  each  memory  reference. 
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Figure  4 

Operations  are  divided  into  three  classes:  the 
first  class  includes  operations  In  which  an  operand 
Is  accessed  by  the  operation.  Examples  are  read, 
write,  add.  and  move.  In  this  ease  the  Instruction 
Is  fetched  from  a  code  segment  and  the  operand  Is 
In  a  data  segment.  The  content  of  Rx  will  not 
change  after  suck  operations. 
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The  second  class  includes  simple  control 
operations  such  as  jump  and  set  flag.  The  expected 
access  right  for  such  operations  is  "execute"  and 
the  content  of  Rx  will  change  after  such  operations 
If  and  only  If  a  transfer  to  another  segment 
occurs. 

The  third  class  Includes  call  and  return 
operations.  In  this  case  the  expected  access  right 
Is  "enter"  or  "return"  and  again  the  content  of  Rx 
will  change  after  such  operations  If  and  only  if  a 
transfer  to  another  segment  occurs. 

Capability  checking  can  be  Implemented  using 
memories  that  are  as  fast  as  the  main  memory.  When 
the  main  processor  loads  the  pages  of  a  process 
into  the  main  memory,  the  corresponding 
capabilities  are  loaded  into  the  SAT.  taring  the 
execution  of  a  process,  some  of  Its  pages  may  be 
swapped  In  and  out  of  the  physical  memory.  For 
each  page  replacement,  only  one  entry  In  the  SMT  is 
updated.  The  entries  of  the  SMT  corresponding  to 
the  removed  pages  are  marked  as  Invalid  and  any 
access  to  these  pages  Is  considered  Illegal. 

In  capability  checking  the  accessibility  of 
memory  segments  is  checked  on  the  basis  of  physical 
addresses  at  the  processor-memory  Interface.  Once 
an  illegal  access  Is  detected  the  CPU  Is  Informed 
and  a  recovery  routine  Is  Initiated.  Since  the  CP 
operates  In  parallel  with  the  CPU,  It  does  not 
degrade  the  system  performance.  Notice  that 
updating  the  SAT  and  the  SMT  can  be  overlapped  with 
the  time  required  for  swap-ln  and  swap-out  of  the 
pages  which  Is  a  slow  operation.  It  is  also 
possible  to  take  samples  of  the  memory  references 
(at  a  slower  rate)  and  use  the  same  concept  for 
checking  the  capabilities  for  those  samples. 
However,  In  this  case  since  the  checking  Is  not 
done  exhaustively,  some  Illegal  accesses  may  go 
undetected. 

cpNavsiqre 

Low-cost  watchdog  processors  can  be  designed  to 
detect  abnormal  behaviors  of  a  system  under 
operation.  In  capability  checking  the 
accessibility  of  each  memory  reference  is  checked 
on  the  basis  of  .physical  addresses  at  the 
processor-memory  Interface.  Since  the  checking  Is 
done  In  parallel  with  the  main  processor,  there  Is 
no  degradation  In  the  system  performance.  However, 
there  Is  a  possibility  that  a  few  Illegal  accesses 
occur  before  the  CP  signals  the  main  processor.  In 
order  to  keep  track  of  Illegally  accessed 
locations,  a  buffer  can  be  used  to  save  the  address 
of  the  last  m  (e.g.  m*10)  references. 

Such  a  capability  processor  can  be  used  as  a 
redundant  protection  scheme  in  systems  where  a  high 
degree  of  security,  Is  required.  On  the  other  hand, 
this  method  by  Itself  is  an  economical  way  for 
Increasing  the  reliability  of  small  systems. 

Current  active  research  in  this  area  Includes 
the  design  of  watchdog  processors  for  checking  the 
flow  of  execution  or  the  integrity  of  data 
structures.  Preliminary  results  in  this  area  are 
given  In  [Nam  811. 
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ABSTRACT 

In  this  paper,  an  encoder  algorithm  for  the 
design  of  an  autonomous  Linear  Feedback  Shift 
Register  (LFSR)  with  specified  minimus  distance  and 
cycle  length  Is  presented.  The  fault  detectability 
on  the  feedback  path  of  this  LFSR  encoder  Is  then 
discussed.  This  shift  register  design  significantly 
extends  the  work  In  the  literature  [Hsiao  771 
(Pradhan  78],  and  Is  based  on  cycllo  codes. 


1  INTRODUCTION 

An  autonomous  Linear  Feedback  Shift  Register 
(LFSR)  Is  an  autonomous  linear  sequential  network 
[El spas  59]  [Kautx  65]  [Zierler  591  for  generating 
sequences  of  a  given  cycle  length  (or  period) .  This 
LFSR  Is  composed  of  Interconnections  of  unit-delays 
(or  D  Flip-Flops)  and  oodulo-2  adders  (or  Exclusive 
OR  gates) ,  as  shown  In  Figure  1 . 

The  LFSR  has  been  used  In  many  different 
applications.  Example  of  these  applications  are: 
pseudo-random  number  generators  [Golamb  67], 
signature  analyzers  rBenowitx  75 1  [HeCluakey  81], 
shift  register  counters  [Gschwlnd  75],  store 
address  generators  [Hsiao  77]  [Pradhan  78],  etc. 
For  Instance,  In  the  LSI/TLSI  chip  designs  using  a 
random  testing  scheme  [Losq  76],  the  random  Input 
sequence  feeding  Into  the  Device  Under  Test  (DUT) 
can  be  autonomoualy  generated  by  an  LFSR  of  mlnlmug 
(Hamming)  distance  1,  or  by  an  LFSR  encoder  of 
minimise  distance  at  least  2  [Hsiao  77], 

An  LFSR  of  minimus  distance  1  (or  distance-! 
LFSR)  Is  an  autonomous  LFSR  with  mlnlmug  Hamming 
distance  T  among  the  generated  states.  It  oannot 
detect  any  fault  Inside  Itself.  If  a  fault  (or 
error)  occurs  that  causes  s  faulty  Input  sequenoe 
to  the  DUT,  albeit  good,  the  output  response  will 
be  Incorrect  which  may  result  In  the  DUT  being 
rejected.  This  fault  may  be  detectable  on-line  If 
an  LFSR  encoder  of  minimus  distance  at  least  2 


followed  by  an  error  detector  Is  adopted.  It 
depends  on  how  big  the  minimum  distance  Is  used. 


2  LFSR  PROPERTIES 


Figure  1  shows  a  general  form  of  an  n-stage 
LFSR  with  corresponding  characteristic  polynomial . 
defined  by 

f(x)  «  1*h1x*h2x2*. . .♦hn_1xn~1«xB,  (!) 
where  h^  (!<l<n-l)  la  either  one  or  xero. 


■  •  •  • 

*»  V*  Vi 

Figure  ! .  The  general  form  of  an  LFSR. 


The  behavior  of  an  LFSR  can  be  Interpreted  as 
an  ordered  cyclic  chain  of  states  S;  which  are 
symbolic  representation  of  the  contents  of  an  LFSR 
during  successive  shifts,  given  the  Initial 
contents  as  SQ.  Let  Sj  represent  the  contents  of 

an  n-stage  LFSR  after  the  1th  shift  of  the  Initial 
contents,  SQ,  of  the  LFSR,  and  S^x)  be  the 

polynomial  representation  of  S^,  then  S ^(x)  is  a 


n-1 


polynomial  of  degree  n-1, 

Vs)  *  S10+Silx+---*Sin-1I 

The  following  Is  a  fundamental  relationship  between 
the  states  In  a  cycle  [Hsiao  77]: 

-1-^Sj(x)  mod  f(x). 

If  T  Is  the  least  positive  integer  such  that  f(x) 
divides  *T-1.  then  for  any  state  S^x), 

SjU)  *  x^jfx)  mod  f(x).  (U) 


S1(x)  .  »* 


(2) 


(3) 


The  integer  T  Is  called  the  exponent  of  f(x)  and 
the  period  of  the  LFSR. 
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3  ENCODER  DESIGN 

Let  a  polynomial  with  coefficients  in  the 
Galois  Field  GF(q)  [Peterson  72]  be  said  to  be  a 
polynomial  over  GF(q) .  A  polynomial  p(x>  of  degree 
m  over  GF ( q )  is  celled  primitive  if  its  root  b  of 
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GF(qa)  with  p(b)*0  generates  all  tha  nonzero 

elements  of  GF(qa).  A  polynomial  g(x)  of  degree  n-k 
over  GF'q)  for  generating  an  (n,k)  cyclic  code  la 
called  a  generator  polynomial  If  It  Is  unique  and 

Is  a  divisor  of  xn-1.  An  (n,k)  cyclic  code  Is  an 

w 

(n,k)  linear  code  containing  a  set  of  q  n-tuple 
distinct  code  words  with  the  following  property:  If 
an  n-tuple  Is  a  code  word,  then  the  resulting  n- 
tuple  by  rotating  the  code  word  one  place  to  the 
right  Is  also  a  code  word  CLln  70], 

Let  p(x)  and  g(x)  denote  a  primitive 
polynomial  of  degree  k  and  a  generator  polynomial 
of  degree  n-k  over  Galois  Field  GF(q), 
respectively.  [Hsiao  77]  has  shown  that  given  a 
required  k  message  digits  and  a  desired  design 

(minimum)  distance  d  ...  an  autonomous  LFSR  can  be 
sin  . 

constructed  to  generate  q  -1  n-tuple  distinct  code 
words  by  using  the  characteristic  polynomial 
f ( x ) *g( x)p(x) •  The  Initial  contents,  Sg(x),  of  the 

LFSR  can  be  preset  to  any  nonzero  code  word .  This 
theorem  is  based  on  an  (n,k)  cyclic  code  with  g(x) 

dividing  xn-1,  and  Is  applicable  to  an  LFSR  design 

with  period  T*qk-S-1  for  0<s<k  by  deleting  s  digits 
from  the  k  message  digits.  The  result  Is  an 
(n-s,k-s)  shortened  cyclic  code.  However,  It  Is  not 
applicable  to  an  LFSR  generating  an  arbitrary 

period .  kith  period  T  not  equal  to  qk-s-1 ,  the 
Initial  contents  of  the  LFSR  have  to  be  chosen  very 
carefully  to  avoid  producing  an  Incorrect  period. 

Definition  1:'  An  (n,k,T)  LFSR  encoder  Is  an  n- 
stage  autonomous  LFSR  for  generating  T  n-tuple 
distinct  code  words  (or  states)  by  the  T 
conaecutlve  shifts  of  the  LFSR.  It  consists  of  k 
message  digits  and  n-k  parity  check  digits. 

Theorem  1 :  An  (n,k,T)  LFSR  encoder  can  be 
constructed  using  f(x)*g(x)p(x)  aa  a  characteristic 
polynomial  over  GF(q) ,  if  the  Initial  contents, 
SQ(x),  of  the  LFSR  is  divisible  by  g(x),  l.e., 

S0(x)»g(x)a(x> ,  end  both  a(x)  and  p(x)  have  no 

common  factor,  where  g(x)  is  a  generator  polynomial 
of  degree  r*n-k  for  generating  an  (n,k)  cyclic 
code,  p(x)  is  a  polynomial  of  the  smallest  degree  k 
for  generating  a  prescribed  period  T,  and  a(x)  Is  a 
polynomial  of  degree  k-1  or  less. 

Proof:  See  [Hang  82]. 

Theorem  1  is  applicable  to  an  (n-s.k-s.T’) 
LFSR  encoder  design  by  deleting  s  digits  from  the  k 
message  digits  for  0<s<k.  It  lmpllea  that  every 
code  word  S^x)  of  an  LFSR  encoder  Is  divisible  by 

g(x),  and  the  Greatest  Common  Divisor  (GCD)  of 
SQ(x)  and  p(x)  Is  g(x),  l.e.,  GCD(Sg(x),  p(x))  « 

g(x).  If  the  GCD  of  Sq(x)  and  p(x)  Is  not  g(x), 

the  LFSR  will  produce  an  Incorrect  period  and  may 
result  In  different  minimus  distance. 


Ex  maple  1 :  Design  a  (6, 3, A)  LFSR  encoder  using 

f(x)  sg(x)p(x)  «(1*X*X3)*[(1*X)(1*X2)]  *1*X3*X3*X®. 
The  circuit  Is  shown  In  Figure  2.  The  desired 
SQ(x)’s  for  the  (6,3.1)  LFSR  encoder  are  marked  by 

(*).  Each  Sq(x)  and  p(x)  have  a  GCD  g(x)*1*x*x3. 

Table  ?.  The  S0(x)’a  with  resulting  T  and  bBin- 

SQ(x)  T 

1*x*x3  1 

(Ivi+i^XUi)  2 

(1*x*z3)(1*x2)  1 

(1*x*x3)(1*x*x2)  1 


“min 
1  (•> 

« 

0 

1  (•) 


Lj — i  I — i  j — 1 1,1 — i  .1 — i  l.i — lJ 

_ |  “  l  1  l  11  r* 

Figure  2.  A  (6,3,1)  LFSR  enooder 


The  synthesis  of  a  polynomial  p(x)  of  the 
smallest  degree  k  for  a  prescribed  period  T  was 
presentad  In  [Hang  82].  Theorem  2  provides  an 
encoder  algorithm  to  derive  the  required  g(x)  baaed 
on  Bose-Chaudhurl-Hocquenghem  (BCH)  codes  [Peterson 
72].  BCH  codes  are  cyclic  codes.  Let  b  be  an 

element  of  GF(qB).  For  any  specified  Integer  c  and 
design  distance  d,  the  code  generated  by  g(x)  Is  a 
BCH  code.  If  and  only  If  g(x)  Is  the  polynomial  of 


the 

,0*1 


mallest  degree  over  GF(q)  for  which  b 


b°*d“2  are  roots. 


Theorem  2t  A  (2a-1,k,T)  LFSR  encoder  with 
design  distance,  dBin,  «t  least  2t*1,  where  t  is 

an  integer,  can  be  constructed  using  f(x)*g(x)p(x) 
as  a  characteristic  polynomial,  if  the  polynomial 
p(x)  Is  of  the  smallest  degree  k  for  generating  a 
prescribed  period  T,  and  the  generator  polynomial 
g(x)  of  the  code  Is  given  by 


g(x)  «  LCM<m1 (x) ,  m^ ( x > , . . . ,  “2t-1 (*)>•  <5) 

the  Least  Common  Multiple  (LCM)  of  maxi's  of 
degree  a  (1*1',3.5, ... ,2t-1) ,  where  m^(x)  Is  a 
primitive  polynomial  of  degree  m.  Its  root  b  over 

GF(2a)  Is  of  order  2a-1,  and  m  (x) (1*3,5 . 2t-1) 

Is  the  minimus  polynomial  of  b  . 

Proof:  See  [Hang  82]. 

A  minimum  polynomial  m^x)  of  root  b1  over 
GF(qa)  Is  a  polynomial  of  the  smallest  degree  over 
GF( q)  such  that  m  (bSeO.  Table  C.2  [Peterson  72] 

1  l 

provides  a  list  of  minimum  polynomials  of  root  b 
of  degree  31  or  less.  Since  a  primitive  polynomial 
is  a  minimum  polynomial  of  root  b.  g(x)  can  thus  be 
found  from  Table  C.2  [Peterson  72).  Table  2  lists 
the  required  g(x)'s  of  degree  r  for  designing  some 

of  (2B-!,k,T)  LFSR  encoders  with  dBln=3.  5.  or  7. 
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The  number  of  k  message  digits  was  obtsinad  by 
satting  it  to  2B-1 -r. 


Table  2.  The  g(x)'s  for  (2B-1,k,T)  LFSR  encoders 
of  dmin*3,  5'  or  7- 


(2B-1 ,k,T) 

(7.4.5) 

(15,11.23) 

(15,7.127) 

(15.5.3D 


3  1+x+x 

4 

3  Uhx 

5  (1*X**i,)(  I*!**2**3**11) 

7  (1m*«,)(1««2«3*«,)(U»i2) 


The  above  theorem  ia  applicable  to  <n',k',T') 
>  (2B-T-s,k-a,T')  LFSR  encoders  tHaiao  773  (Lin  703 

(Peterson  723  for  0<a<2*"1.  Since  f(x)  always 
contains  a  factor  1*x  when  T  la  even,  by  the 
property  (Hsiao  773  (Pradhan  783  (Peterson  723  that 
the  minimus  distance  of  the  generated  code  apace  la 
even  if  1+x  la  a  divisor  of  f(x),  the  resulting 
design  distance  of  the  LFSR  encoder  will  be  an  even 
nuxber  2t*2,  if  the  period  T  ia  even;  and  an  odd 
nuaber  2t«1 ,  if  T  is  odd.  In  Implementing  an  LFSR 
encoder  with  desired  even  minlmim  distance  of  at 
least  2t*2  for  a  prescribed  odd  period  T,  the 
generator  polynomial  should  be  modified  as 

g(x)  *  (Ux)*LCM<m1(x) .  m,(x) ,...,  m2t_1(x)>.  (6) 

Example  2;  The  following  examples  were  derived  from 
Table  2  by  deleting  acme  message  digits.  For 
Instance,  the  (6,3.4)  was  derived  from  the  (7,4,5) 
of  dnln*3  by  deleting  one  message  digit.  Since  the 

period  4  is  an  even  number,  the  (6,3,4)  LFSR 
encoder  will  have  a  design  distance  3*1 »4. 

Table  3,  (2B-1-s,k-s.r)  LFSR  encoders. 


Definition  2;  An  (n,k,T)v(2B-1,k,2k-1)  cyclic 

Lf 

code  is  an  (n,k>  cyclic  code  of  period  2-1.  An 

(n*,k',2k  -1)*(2B-1-s,k-s,2k-5-1)  shortened  cyclic 
code  is  an  (n',k')  shortened  cyclic  code  of  period 

|rBa 

2  -1 ,  by  deleting  s  digits  from  the  k  message 

digits  for  0<s<2b"1. 

(Hsiao  773  has  shown  that  in  the  (n,k,2k-1)  * 

(2B-1 ,k,2k-1 )  cyclic  codes,  the  effect  of  a  fault 
on  the  autonomous  LFSR  feedback  path  is  only  to 
produoe  a  noncode  word  which  differs  in  the  first 
stage  of  the  LFSR  from  some  valid  code  word.  In 

implementing  an  LFSR  encoder  by  (n',k',2k-1)  a 

■  l/^a 

(2  -1-a,k-s,2  -1)  shortened  eyollc  codes  for 

0<s<2B_1 ,  the  effeot  of  a  fault  on  the  LFSR 
feedbaok  path  is  proven  to  produoe  errors  exactly 
at  distance  daln-1  from  some  valid  code  word. 

Theorem  3;  In  an  (n',k',2k  -1)  LFSR  encoder 
design  implementing  a  shortened  cyclic  code,  the 
effect  of  a  fault  that  makes  the  entire  feedback 
path  aasisie  an  erroneoua  value  is  to  change  the 
correct  state  (a  oode  word)  to  an  erroneous  state 
(noncode  word)  which  is  exactly  at  dlstanoe  d  -1 
from  some  other  valid  state.  Bln 

Proof;  See  (Wang  823. 

Theorem  3  extenda  the  fault  detectability  in 

(Hsiao  77)  to  (o' ,k’ .P*4'-))  shortened  cyclic  codes. 

k' 

For  a  oode  of  period  T'  not  equal  to  2-1,  the 
fault,  making  the  entire  feedback  path  assume  an 
erroneous  state  in  realizing  an  LFSR  encoder,  may 
certainly  cause  a  noncode  word  at  distance  more 


(n'.k'.r) 

(6.3.4) 

(11.3.4) 

(7.3.7) 

(12.3.7) 


min 

4 

6 
4 
6 


p(x) 

1*x*x2*x3 
1*x*x2»x3 
1*x*x3 

(1♦x)(1♦x♦x,,)(1*x♦x2♦x3♦x<,)  1*x*x3 


g(x) 

1*x*x3 
( 1  ♦x*x* )  (  1  ♦X*X2*X3*X11 ) 
(Ux)(1*x*x3) 


4  FAULT  DETECTABILITT 

An  encoder  algorithm  for  the  LFSR  design  with 
desired  period  T  and  minimum  dlstanoe  dBl[)  has  been 

presented.  This  LFSR  encoder  gives  a  Totally  Self- 
Checking  (TSC)  error  detector  the  on-line  fault- 
detection  capability  to  detcat  at  most  dBin-1 

errors  on  the  encoder  output  (Wang  823.  The  fault 
model  can  be  any  combination  of  faults,  such  as 
stuck-at  faults,  bridging  faults,  snd  external 
disturbances  (or  noise) ,  which  manifest  themselves 
by  changing  at  most  d^-1  positions  on  the 

output  of  the  LFSR  encoder.  However,  it  is  not 
clear  whether  the  fault  that  makes  the  entire 
feedback  path  assume  an  erroneous  state  is 
detectable  or  not.  For  lnatanoes,  an  error  on  the 
feedback  path  of  Figure  2  may  produce  3  errors  in 
the  state. 


than  1  for  n'«2B-1  (or  at  distance  more  than  d  ,  -1 

Bln 

for  n’  not  equal  to  2-1)  from  the  T*  states. 
Fortunately,  since  the  same  g(x)  can  be  used  to 

realize  both  (n'.k'.r)  and  (n' ,k' ,2k'-1 )  LFSR 
encoders,  and  the  same  error  detector  can  be 
adopted  to  detect  errors  within  themselves,  the 
fault  detectability  will  be  the  same  when  both 
criteria  are  implied.  Moreover,  since  the 
generator  polynomial  g(x)  in  an  LFSR  encoder  of  an 
even  period  of  dolnJ2t*2  is  one  degree  less  than 

that  in  an  LFSR  encoder  of  an  odd  period  of 
dBin*2t*1  for  t-error-correctlng,  the  produced 

noncode  word  due  to  a  fault  on  the  feedback  path 
will  be  always  at  distance  2t,  Irrelevant  of  the 
period  being  even  or  odd.  This  is  summarized  below: 

Corollary  1;  Suppose  that  the  same  generator 
polynomial  g(xT  is  used  to  Implement  both 

(n'.k’.T')  and  (n' ,k' ,2k'-1 )  LFSR  encoders  for  T' 

k> 

not  equal  to  2  -1 ,  snd  both  LFSR  encoders  use  the 
same  error  detector  to  detect  errors  within 
themselves.  Then  the  effect  of  the  fault,  which 
makes  the  entire  feedback  path  assume  an  erroneous 
state  in  resllzlng  an  LFSR  encoder  of  period  T' , 
will  produce  a  noncode  word  which  (1)  differs  in 
the  first  stsge  of  the  LFSR  encoder  from  some  code 
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word  In  the  (n,k,2k-1)  cyclic  oodes,  and  (2)  Is 
exactly  at  distance  dBln-1  (or  dBin”2^  fro"  *oa® 

code  word,  for  T*  odd  (or  even),  in  (n',k,,2lt  -1) 
shortened  cyclic  codes,  respectively. 

Example  3;  A  (6,3,4)  distance-4  LFSR  encoder  can 
use  the  same  error  detector  as  a  (6,3,7)  dlstance-3 
LFSR  encoder.  Figure  3  shows  the  (6,3,7)  dlstance-3 

LFSR  encoder  using  f(x)«|(x)p(x)»(1*x*x^)#(1ex2*x^) 

iHxn2^*!1*!5*!®.  The  coda  space  consisting  of 
the  7  6-tuple  nonzero  code  words  is  given  in  Table 
4.  Marked  by  (*)  are  the  states  generated  by  a 
(6,3.4)  distance-4  LFSR  encoder  as  shown  in  Figure 
2.  If  the  feedback  path  on  both  (6,3,7)  and 
(6,3.4)  LFSR  encoders  were  stuck  at  one,  then  the 
next  noncode  words  for  the  same  input  code  word 
<110100>,  for  Instance,  would  be  <100101>  and 
<111111>,  respectively.  Both  invalid  code  words  are 
exactly  at  (minimum)  distance  2  from  the  code  words 
<1000ll>  and  <101110>,  respectively,  although  they 
produce  up  to  6  and  3  errors  on  the  next  states, 
respectively.  Thus,  the  fault  can  be  immediately 
detected  by  the  error  detector. 

Table  4.  Code  space  of  a  (6,3,7)  dlstance-3  LFSR 
encoder. 

1  1  0  1  0  0  (•) 

0  1  1  0  1  0  (•) 

0  0  1  1  0  1  (•) 

1110  0  1 
1  0  0  0  1  1  (•) 

10  1110 
0  10  111 


Figure  3.  A  (6,3.7)  diatanee-3  LFSR  encoder. 


5  CONCLUSIONS 

In  this  paper,  an  encoder  algorithm  is 
presented  to  derive  the  required  characteristic 
polynomial  and  the  initial  contents  for  an  LFSR 
encoder  design  with  specified  minimus  distance  and 
cycle  length.  It  shows  that  in  designing  an  LFSR 

)( 

encoder  with  cycle  length  not  equal  to  2  -1,  for  k 
an  Integer,  its  initial  contents  have  to  be  chosen 
under  certain  clrcunstances  to  avoid  resulting  in  a 
shortened  cycle  length.  Any  LFSR  encoder  with  even 
cycle  length  always  produces  an  even  design 
(minimus)  distance.  The  fault  detectability  study 
on  the  feedback  path  of  the  LFSR  encoder 
implementing  shortened  cyclic  codes  indicates  that 
the  fault,  making  the  entire  feedback  path  assume 
an  erroneous  state,  is  to  produce  a  noncode  word 

exactly  at  distance  d  .  -1  from  some  code  word.  For 
min 

any  combination  of  faults,  such  as  stuek-at 
faults,  bridging  faults,  and  external  disturbances 
(or  noise)  within  the  circuit,  which  manifest 
themselves  by  changing  at  most  d  ,  -1  positions  on 

{Bin 

the  LFSR  encoder  output,  the  faults  are  detectable. 
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